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Preface 



The papers in this volume were presented at the 4th International Conference on 
Large-Scale Scientific Computations ICLSSC 2003. It was held in Sozopol, Bul- 
garia, June 4-8, 2003. The conference was organized and sponsored by the Cen- 
tral Laboratory for Parallel Processing at the Bulgarian Academy of Sciences. 
Support was also provided from the Center of Excellence “BIS 21” (funded by 
the European Commission), SIAM and GAMM. A co-organizer of this tradi- 
tional scientific meeting was the Division of Numerical Analysis and Statistics 
of the University of Rousse. 

The success of the conference and the present volume in particular are the 
outcome of the joint efforts of many colleagues from various institutions and 
organizations. First thanks to all the members of the Scientific Committee for 
their valuable contribution to forming the scientific face of the conference, as 
well as for their help in reviewing contributed papers. We would like to specially 
thank the organizers of the special sessions: R. Blaheta, N. Dimitrova, A. Ebel, 
K. Georgiev, O. Iliev, A. Karaivanova, H. Kosina, M. Krastanov, U. Langer, 
P. Minev, M. Neytcheva, M. Schafer, V. Veliov, and Z. Zlatev. We are also 
grateful to the staff involved in the local organization. 

Special Events: 

The conference was devoted to the 60th anniversary of Raytcho 

Lazarov. 

During the conference, the nomination for the World Level of the 

Hall of Fame for Engineering, Science and Technology, HOFEST, 

was ofRcially awarded to Owe Axelsson. 

Traditionally, the purpose of the conference is to bring together scientists 
working with large-scale computational models of environmental and industrial 
problems, and specialists in the field of numerical methods and algorithms for 
modern high-speed computers. The key lectures reviewed some of the advanced 
achievements in the field of numerical methods and their efficient applications. 
The ICLSSC 2003 talks were presented by university researchers and practical 
industry engineers, including applied mathematicians, numerical analysts and 
computer experts. The general theme for ICLSSC 2003 was Large-Scale Scientific 
Computing, focusing on: 

— Recent achievements in preconditioning techniques; 

— Monte Carlo and quasi-Monte Carlo methods; 

— set- valued numerics and reliable computing; 

— environmental modelling; 

— large-scale computations for engineering problems. 

More than 90 participants from all over the world attended the conference, 
representing some of the strongest research groups in the field of advanced large- 
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scale scientific computing. This volume contains 55 papers submitted by authors 
from over 20 countries, out of which 5 were invited and 50 were contributed. 
The 5th International Conference LSSC 2005 is planned for June 2005. 



November 2003 Ivan Lirkov 

Svetozar Margenov 
Jerzy Wasniewski 
Plamen Yalamov 
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Eigenvalue Estimates 
for Preconditioned Saddle Point Matrices 



Owe Axelsson 

University of Nijmegen, Nijmegen, The Netherlands 
axelssonOmath . kun . nl 



Abstract. Eigenvalue bounds for saddle point matrices on symmetric 
or, more generally, nonsymmetric form are derived and applied for pre- 
conditioned versions of the matrices. The preconditioners enable efficient 
iterative solution of the corresponding linear systems. 



1 Introduction 

Matrices on saddle point form arise in constrained problems which in various 
forms occur in applications such as in constrained optimization, flow problems 
for incompressible materials, and domain decomposition methods, to name a 
few. For large scale such problems iterative methods must be used and require 
then efficient preconditioners. 

We derive first bounds on the eigenvalues of symmetric forms of saddle point 
matrices, showing how they depend on the top matrix block and the Schur com- 
plement matrix. The bounds are then improved using a congruence transforma- 
tion of the matrix, which makes the off-diagonal matrices small. The transforma- 
tion is followed by a block-diagonal preconditioner. It is seen that the off-diagonal 
blocks have only a second order influence on the resulting eigenvalue bounds, 
and the eigenvalues depend mainly only on the preconditioned top matrix and 
negative Schur complement matrix, which, by assumption, both have positive 
eigenvalues. The eigenvalues of the resulting preconditioned matrix are real and 
cluster around —1 and -1-1. 

A corresponding form of transformation can be applied for nonsymmetric ma- 
trices. One source of nonsymmetry can be due to a shift of sign of the constrained 
equations and we show in this case how the eigenvalues of the preconditioned 
matrix cluster around the unit number in the complex plane. 

The preconditioned matrix problems can be solved using a generalized mini- 
mal residual form of the conjugate gradient method. For intervals symmetrically 
located around the origin, the effective condition number equals the square of the 
condition number for each of the two intervals, showing the importance of having 
preconditioned the matrices to have intervals with sufficiently small condition 
numbers. 

For problems where the matrix defining the constraint is (nearly) rank defi- 
cient or, equally, the corresponding term in the Schur complement matrix has a 
zero or small eigenvalues, one can apply a regularization technique to stabilize 
the smallest eigenvalue. Such methods have been used in [1, 3], for instance. 
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The present method is applicable also for nonsymmetric problems on saddle 
point form. Here the top matrix block is strongly nonsymmetric when it origi- 
nates from a convection dominated convection diffusion problem, for instance. 
For such problems it is important to use a discretization method which leads to 
nearly M-matrices enabling a efficient preconditioning, in particular if a proper 
node ordering based on the directions of the flow field has been used. Examples 
of such methods can be found in [4] and [5], for instance. 

Some of the above estimates have previously been derived in [6]. See also 
other references cited there. In the present paper the estimates of eigenvalue 
bounds for the preconditioned matrices are derived from the general estimates 
for matrices on saddle point form. In particular, they generalize the methods 
presented in [6, 7] and elsewhere, where the preconditioning matrix itself has a 
saddle point form but is factorized in block triangular factors. 



2 Eigenvalues of Matrices on Saddle Point Form 



Given A = 



M 

B -C 



, where M of order n x n is symmetric and positive defi- 



nite, C of order m x to is symmetric and it is assumed that the negative Schur 
complement matrix S = C + BM~^B"^ is positive definite. 

Note that if C = 0, the latter assumption implies that m < n and B has full 
rank = to. 

Let 



0 < ^1 < ^2 < ■ • ■ < Mn be the eigenvalues of M 
0 < CTi < (72 < ■ • • < o-m be the eigenvalues of BM~^B"’" . 
A can be factored as 



' h o' 




' M 0 ■ 






BM~^ L 2 




I 

Co 

1 

0 




1 

0 

1 



where /i, I2 are corresponding unit matrices, that is, it can be written as a con- 



gruence transformation of 



M 0 
0 -S' 



Since, by assumption, the latter matrix is 



indefinite and since, by Sylvester’s theorem, congruence transformations preserve 
the signs of the eigenvalues it follows that A is indefinite and also nonsingular, 
because both M and S are nonsingular. We shall compute intervals that contain 
the eigenvalues of A. Being symmetric and indefinite, the eigenvalues of A are 
real and located on both sides of the origin. 

The next theorem shows bounds for the eigenvalue intervals. 



Theorem 1. Let A = 



M B'^' 
B -C ’ 



where M and S = C + BM ^B"^ are sym- 



metric and positive definite. Let 0 < pi < p.2 < ■ ■ ■ < p-n, 0 < cti < (T2 < 
. . . < ( 7m he the eigenvalues of M and BM~^B"'^ , respectively, and let 7^ = 

f/ie spectral radius. Then 
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(a) The eigenvalues (A^) of A are located in the two intervals 



Amax(*5*) ; 



-XminiS) 



l+^Xmin{S) 



U [/Tl, /in 



(b) If C is positive semidefinite then the upper bound can be replaced by the more 
accurate bound 

1 + \/l + 4am/ h'n 

/in ^ . 

(c) If C = 0 and B has full row rank, then 



Xi G 



U 



-CTl 



1 (i + + “S) A (i + v/1 + “ft) 



Ml ,Mn- 



1 + Jl + 4^ 



Proof. Let A, “ , |a;| + \y\ ^ 0 be an eigenpair of A, i.e., A 
rewrite this eigenvalue problem in the form 

j M^x + M~^ B'^y = XM~ix 
[BM-i{M^x) = {Xl 2 +C)y 

or, with B = BM~i , x = Mix, in the form 

j x + B'^y = XM~^x 
\bx = {XI 2 + C)y 

This is further rewritten in the form 

j x + B’^y = XM~'^x 

\{Xl 2 + S)y= Bx + B B^y 

or _ 

j {XM~^ - h)x = B^y 

\ {Xl 2 +S)y = XBM~^x ' 

Since A is nonsingular, it holds that A 7 ^ 0. 

Given an eigenpair, we consider the two cases, 

(i) A > 0, (ii) A < 0 separately. 

(i) A > 0: In this case (A /2 + S) is nonsingular and (2) reduces to 



= A 



. We 



( 1 ) 



( 2 ) 



(AM-i - h)x = XB^fXh + S)-^BM~^' 
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or 

{Ml - M)x = XB^iXh + Sy^Bx 

where x = M~^x. 

Hence 

Xx^x = x^Mx + x^ B"’- {I2 + ^Sy^Bx. 

A 

Since 0 < (/2 + < h, it follows that 

< X< H„+ p{BM~^B'^) = Pn+ CTm- 
(ii) A < 0: In this case XM~^ — Ii is nonsingular and (2) reduces to 
{Xh + S)y = XBM-yXM-^ - hy^B^y 
or 

{XI 2 + S)y = -XB{M - Xhy^B^y, 

SO 

(-A)y ^(/2 + B{M - Xh)-^B^)y = y^Sy. 

Since now A < 0, it follows that 0 < (M — A/i)“^ < M~^, that is 

Amax(5) > -A > y^Sy/y^{l2 + BM-^B^y (3) 



Here 

_ y'^BM-^B'^y y'^BM-^B'^y y"^ Sy 

yTy - yTBM-^BTy ' yT Sy ' yT y 

^ v'^BM-^B'^V ^ y'^BM-^B'^y ^ y'^Sy ^ ^ y'^Sy 
yi'BB'^y y'^'Sy y'-^'y — fii y'-‘y 

where ^ = p{S~^ BM~^B'^ S~i). 

Therefore, by (3) 

_^>^nun(^ 

- i+-£x^ys) 

This proves part (a). For parts (b) and (c) we use (1), which for A > 0 reduces 



to 


(AM-i - Ii)x - B^{Xh + Cy^Bx = 0 




or 


M~^x — Xx^x — {Bxy {I 2 + \-C)~^B a; = 0. 

A 


(4) 


Let 


y^x , {By^{h + {cy^Bx 

«= ~T~ 





x’^M~^x 



Since A > 0, it holds that 0 < (/2 + yC) ^ < I 2 , and it follows that 
0 <b < p{B"’' B) = p{B _B^) = p{BM~^B^) = am- 
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For C = 0, we have 



(Ti <b< am 



It holds 



— Xa — ba= 0 



( 5 ) 



or 

A = \/ 1 + 46/ a) . 

Therefore, since ^ 



1 + + 4(Ti//ii , 1 + \/l + 4(Tm/ Mn 

Ml ^ S A < Hn ^ 

where CTi = Ui if C = 0, otherwise tfi = 0. 

It only remains to prove the bounds for the negative eigenvalues for the case 

C = 0. 

It follows from (4) and (5), 



X^ — Xa — ba= 0 



where b = . 

Ibir 

For A < 0 we get the solution 



5(1 + \/l + 46/ a) 

which, since > 0, shows the stated bounds for C = 0. 

Corollary 1. The eigenvalues of the block-diagonal preconditioned matrix 

A are contained in the intervals [— 1,— 1/(1 + Om)] U [1, 1 + am], 

where am = 7^ = p{S~^ BM~^ S~i). Here am < 1 if C is positive semi- 
definite. 

Proof. Replace S with I 2 and M with Ii in Theorem 1. 

Remark 1. The above bounds are more accurate than previously given bounds 
in [8] and [9]. In addition, they hold even if C is not positive semi-definite. 

An important observation from Theorem 1 is that the negative eigenvalues 
depend essentially only on S while the positive eigenvalues depend essentially 
only on M. If M is well-conditioned, the eigenvalue bounds are most sensitive 
to the values of Aniin(>S') and am. Frequently am (= p{BM~^B'^)) is bounded 
by a not very big number but Amin (•S') can take small values if B is nearly 
rank-deficient and C does not compensate for this. 



M-i 0 
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2.1 Improvement of Eigenvalue Bounds 
by Congruence Transformation 



We show now how the eigenvalue bounds of a given matrix on saddle point 
form can be improved by a congruence transformation ZAZ'^ for a proper block 
triangular matrix Z . Since, by Sylvester’s theorem, a congruence transformation 
preserves the signs of the eigenvalues, as is seen from Corollary 1, the aim is here 
to cluster the eigenvalues around the points —1 and + 1 . 

Let then Hi be an approximate inverse of M and let 



Z = 



h 0 
-BHi h ■ 



Then a computation shows that 



A = ZAZ^ 



M (Ji - MHl)B^ 

B{h - HiM) [-5 + B{h - HiM)M-^{h - MHf)B'^\ 



( 6 ) 

Here —S, where S = C + BM is the Schur complement of A. It is readily 
seen that this also equals the Schur complement of A. Therefore, applying The- 
orem 1 for A, with B replaced by B{I — HiM) shows that the eigenvalues of A 
are contained in the intervals 



1 -|- -\/l -f 4CTm/ Mn 

where 7 ^ = 0{\\li — am = 0{\\li — MHiW^). When Hi is a sufficiently 

accurate preconditioner to M~^ it follows that the eigenvalues are contained 
approximately in the intervals 



'^max(5’) , 



-XminjS) 

1 + Amin(5')7^/Ml. 



[ Amax(5*), Amin(5’)] U 

where the perturbations are of second order. 

Applying now proper symmetric and positive definite preconditioners Di of M 
and D 2 of S will shrink the intervals to (approximately) 

[ Amax(*S’), Aniin(*S’)] U [^ 1 ,^ 7 ^], 

where Jli,Jln are the extreme eigenvalues of M = and where S = D^^S. 

The matrix Hi can be a sparse matrix, possibly just diagonal, if it suffices to 
make \\Ii — MHi\\ small. We can also apply a more accurate preconditioner, such 
as Hi = Di^ . The most sensitive part in the preconditioning is determined by 
Amin (‘S'), which is small if B is nearly rank deficient and the matrix C does not 
stabilize, i.e. compensate for this defect. See further Section 4 for comments on 
this. 
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Block-Diagonal Preconditioner; Computational Complexity 



To compute the action 



Di 0 
0 £»2 



, on a vector 



Zl 

Z2 

yi 

y2 



of the preconditioned matrix T> ^A, where I? = 
, two actions of Hi and one of and take 



place, in addition to some matrix vector multiplications and vector additions. 



For the choice Hi = suffices with two actions of ^ and one of D 2 if 

the following algorithm is used: 



(i) Solve 

(ii) Solve 

(iii) Compute 

(iv) Compute 

(v) Solve 



£>ivi = B'^y2 

Diwi = M{yi - vi) 

Zl = Vi + Wi 

W2 = .B(yi - vi - Zl) - Cy2 
D2Z2 = W 2 



2.2 Iteration Number Bounds 



The preconditioned saddle-point system can be solved by a generalized conjugate 
gradient-minimal residual method (see e.g. [2]), which are based on the best 
approximations in the Krylov subspace 



where B = T> is the preconditioner and = T>~^{Ax'^ — a), for a given 

initial vector For symmetric matrices, the rate of convergence of such methods 
can be estimated by a best polynomial approximation, namely for some norm 
II II w, it holds 






< min max|Pfe(a:i)| 



where {A^} is the set of eigenvalues of B and denotes the set of polynomials 
of degree k normalized at the origin. 

For matrices with eigenvalues located symmetrically about the origin in in- 
tervals of equal length, it follows that the best polynomial approximation takes 
the form 

min max |Pfe(A^)|, 



which corresponds to a condition number that is, the square of the 

condition number for a single interval. For non-equal length intervals better 
bounds can be derived, but the above shows that it is important to precondition 
both S and M so that the condition number of each interval is not too large. 

It also shows that there are good reasons to look for preconditioning meth- 
ods which transforms the eigenvalues to the positive halfplane, which we now 
consider. 
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3 Clustering the Eigenvalues around the Unit Number 
in the Complex Plane 

Instead of using a preconditioner which clusters the eigenvalue around the points 
— 1, +1 on the real axis, it can be more efficient to precondition the matrix to 
cluster the eigenvalues around the unit number in the complex plane. 

Applying the congruence transformation in (6), we change then the sign in 
the lower block part of the matrix and precondition it with a block diagonal 
'Hi 0 ' 

0 L»2 

corresponding eigenvalue problem takes then the form 



matrix, 



where i = 1,2 are symmetric and positive definite. The 



M {I I - MHl)B^ 

-B{h - HiM) S - B{h - HiM)M~\h - MHf)B'^ 

In, n 1 I-r-1 

A 



which can be rewritten in the form 





X 




y. 



'Hi 0 ■ 




X 


0 D2 




y. 



X 




y. 




1 




' 2 

7 


s = 



M B^ 

-B [S-BM-'^B^] 





X 




y. 



(7) 



Dl x,y = H| y. ^ ^ 

To find the relative perturbations of the eigenvalues from those of M and S, the 
matrix in (7) is rewritten in the form 



O 


1 


0 


1 



h 

-B h - BB'^ 



1 


O 

1 


1 


0 



( 8 ) 



where B = S~i BM~^ . Here (8) shows that there is a perturbation of the real 
part with 0(||i?|p) and of the imaginary part, arising from the skew symmetric 
part of the matrix, of 0(||i3|]). 

It can be seen that if the eigenvalues are contained in an ellipse symmetrically 
located on the positive real axis, with eccentricity 6 (that is, ratio of the semi 
axes), then the convergence factor which holds for a real interval is multiplied 

by the factor (see e.g. [2]). The importance of having a narrow ellipse, 

5 < \fajh, where (a, 0), (6, 0), 0 < a < 6, are the foci of the ellipse, is seen. 

For applications in partial differential equation problems, where M is non- 
symmetric (such as arising from a convection-diffusion problem) using discretiza- 
tion methods which leads to matrices M which are nearly M-matrices, enables 
the construction of accurate preconditioners to M. 

We show now how such preconditioners cluster the eigenvalues near the eigen- 
values of 1 U {Ai(5')}. The method is applicable also for nonsymmetric matrices. 
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3.1 A General Preconditioning Technique 



Consider a nonsingiilar and in general nonsymmetric matrix, partitioned in two- 
by-two block form, 



A = 



All Ai 2 
A21 A22 



where Aij has order rii x rij. We assume that A is diagonalizable, i.e., has a 
complete eigenvector space and that An is nonsingular. The most important 
application we have in mind is when A is a matrix of regularized saddle point 
form, in which case A22 is zero or negative semidefinite. 

When the given matrix A is indefinite, we may transform it by multiplying 
the matrix rows for the bottom blocks by (—1). In this case the form is typically 



M B^' 
-B C ’ 



where now the block matrix consists of a block diagonal part and a skew- 
symmetric off-diagonal part. The matrix M may be nonsymmetric. 

We shall present a general block incomplete factorization method for A which 
enables clustering of eigenvalues of the preconditioning matrix around the unit 
number or around -1-1 and —1, depending on the signs of certain matrices. 

Let then Ei^Fi, i = 1,2 be nonsingular matrices of orders consistent with 
this partitioning and let 

Aij = E~^AijF~'^, ij = 1,2, Di = E^Fi, z = 1,2. 

Let further 

Ei^ 0 
-A2iEi^ E^^ 

(9) 

Pf 1 -Ff 1^12 

0 



= 



Ei^ 0 
_-E^^A2iDi^ El_ 



Z2 = 



'El ^ -Di ^AnFo 



F 



12r’2 

-1 



Here Di is a preconditioner to An, for instance an incomplete factorization, and 
D2 is a preconditioner to the Schur complement matrix A22 — A2iDi^Ai2 (see 
below for an explanation of these choices). Z2Z1 will be used as a preconditioner 
to A either as a left preconditioner or in the form Z1AZ2, which latter matrix 
is similarly equivalent to Z2ZiA. 

A computation shows that 



A = Z1AZ2 — 



^ An ^ ~ Aii)Ay2 ^ 

A 2 i{Ii — An) A22 — ^21(2/1 — Aii)Ai 2 



The matrix A can be rewritten in the form 



'h O' 




' h O' 




1 

0 




II — Ai 2 


_ 0 F 


\ 


CN 

1 




0 0 




0 



( 10 ) 



( 11 ) 
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where S = A22 — A21A12 = follows from ( 10 ) that 

in order for Z2Z1 to be an accurate preconditioner of A, D\ must be an accurate 
preconditioner of An and E2F2 an accurate preconditioner of A22 — A21D1 ^Ai 2 - 
The accuracies of these approximations will be made precise in the next theorem. 
We need then the following (elementary) result. 



Lemma 1 . It holds 

= V^/(P2l||) 

where f{x) = 1 + + \J ■ 



h 0 
^21 I2 



Proof. Let B 



h 0 

^21 h 



. There holds ||i?|P 



B) where 



B^ B = 



Ii + ^21^21 
A21 I 2 



and a computation shows that p{B^ B) = /(HA21I 



Theorem 2 . Let A = 



he preconditioned by Zi,Z2 in ( 9 ), where 



All Ai2 

A21 A22 _ 

Di = EiFi is a nonsingular preconditioner to An and D2 = E2F2 is a nonsin- 
gular preconditioner to A22 — A2iDf^ Ai2- Then, for a sufficiently accurate pre- 
conditioner Di, the eigenvalues of the block preconditioned matrix A^ Z1AZ2 
cluster around unity and around the eigenvalues of S = 2I22 — ^21^12; where 
AijF~^ . Namely, by a proper ordering of the eigenvalues Xi{A), it 



Aij — E^ 



holds max |Aj(yl) — 1 | < ( 5 , max |Ai(yl) — Ai(S')| < 6 , where 

2=ni + l,...,ni+ri2 



A- 



h 0 

0 s 



<5 = 



{f{\\A2i\\)f{\\Ai2\\)Y\\An-h\ 



and f{x) = 1 + + jx'^. 

Proof. Take norms in the second term of ( 11 ) and use Lemma 1 . 

The theorem shows that we can control the clustering of the eigenvalues to be 
sufficiently close to unity and to the eigenvalues for S by applying a sufficiently 
accurate preconditioner to An. These two approximations can be controlled, 
essentially independently of each other, the only dependence is that S involves 
the inverse of the first preconditioner. 

Note that for saddle point matrices S we may have eigenvalues with positive or 
negative real parts, depending on if the initial the transformation to the form 

^ ^ has taken place, or not. 
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3.2 One-Sided versus Two-Sided Preconditioners 



When dealing with iterative methods for nonsymmetric matrices an additional 
complication arises in that the matrix may not be normal, i.e., not symmetriz- 
able. This means appearance of Jordan blocks, which can delay the rate of con- 
vergence substantially, or cause convergence stagnation. We discuss this issue in 
connection with one-sided (ZiA) versus two-sided {ZxAZ2) preconditioners. 

Consider then the factorization of a nonsingular matrix 



A = 



All Ai 2 
A21 A22 



i.e. 



A = 



'An 0 ■ 




Ii A-^^ Ai 2 


A21 S2 




0 

i 



B = 



Let D\ be a preconditioner to An and D2 to S2 = A22 — A2iAii^Ai 2. Then 

Ji Ai 2 

I2 



for a left-sided, i.e. block Gauss-Seidel preconditioner it holds 

0 






A = 






-£i^M2i£'r' D, 

JJfiAii 0 

—D 2 ^A2i(/i — Di ^Aii) £>2 ^S '2 



-£»2' A2i£>r' £>2 



All 0 

A21 S2 





'h 


^12 




0 

1 


h 



(12) 



Now, even if D\ is an accurate preconditioner to An or, even if D\ = An, 
the presence of the upper off-diagonal matrix ACi^Ai2 in the second factor in 
(12) may cause a delay or stagnation in the convergence of an iterative solution 
method. 



If £>2 = S'2, since then 



-V-^A\ = 



0 0 
0 0 



it follows that the 



h 0 
_ 0 h 

minimal polynomial to is just (1 — A)^, i.e. has degree 2 and the con- 

vergence bound in section 2.2 shows that the GCG — MR method converges in 
just two iterations. However, in practice we have only approximations of An 
and S2 and the degree of the minimal polynomial is the equal or close to the 
order of the whole system rather than 2. In addition, the norms of the matrix 
I A~^ A ' ^ 

powers , fc = 0, 1 , . . ., which arise in a Krylov method, can grow 

[ U I2 \ 

rapidly with increasing number (k) of iterations, causing perturbations and loss 
of orthogonality among the direction vectors in the minimal residual method. 
Numerical test show also a strong dependence on the preconditioning accuracy, 
in particular of £>i to M. 

For reasons of robustness of an iterative method, it is therefore advisable to 
use a two-sided preconditioner, where the powers of the corresponding factors 
grow little. The one-sided preconditioner still involves actions of matrices £)j”^ 
and D2^ , which is normally the computationally most expensive part in each 
iteration step, and little is saved compared to using a two-sided preconditioner, 
while the latter normally gives much fewer iterations. 
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4 Regularization of (Nearly) Rank Deficient Problems 

In finite element analysis the so-called Babuska-Brezzi (BB) condition, which 
is an inf sup condition, is commonly used to analyse the stability of constrained 
boundary value problems. 

This condition can be presented in algebraic form as follows. As we shall see, it 
is related to the smallest eigenvalues of the Schur complement matrix. Let B be 
of order m x n, where m < n and rank (B) = m. First recall that the Moore- 
Penrose generalized inverse of B equals B^ = B"'^ {BB^)~^ . The algebraic form 
of the BB condition is 



_. y'^Bx _ . ^ \\y\\ _ 1 

bll ll^ll \\Bh\\ ||i?t||- 

Here, given y the sup is taken for x = B^y (or its scalar multiple), for which 
Bx = y. 

We have 



Hence, it follows that 

a=^X~{B^). 

Similarly, we obtain for the M-norm of x, where M is spd, 

y"^ Bx . y^BM~^{M^x) 

a = inf sup , = mi sup q 

y X \\y\\y/x'^Mx y X ||j/|| ||M 2 a;|| 

= inf sup ^ = J 

y ^ lly|l fII ^ 

Here cr > 0 if H has full rank but cr = 0 if H is rank deficient. If B is (nearly) 
rank deficient, one must regularize the problem, writing it in the form 



'M B^' 




X 




7 


B -C 




y. 




_g -Cy 



Here the right-hand side term Cy can sometimes be computed from the given 
problem as for Stokes problems or, otherwise, it is treated in a defect-correction 
way, i.e. with some initial approximation y^^\ one computes a correction from 



'M B'^' 




X 




7 


B -C 




y(l)^ 




_g-Cy^°\ 



which may be repeated ones a number of times using the updated solutions. 
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The matrix C is symmetric and positive semidefinite and positive definite on 
the nullspace of 

For the corresponding Schur complement it holds then 
CT = A„,in(C' + BM~^B^) > 0. 

For the i?S-condition to hold in finite element problems one must use stable 
element pairs for the two variables (such as velocity and pressure). This means 
that the space V for x (velocity) must be sufficiently richer than that {W) for y 
(pressure). For the simplest piecewise linear finite elements one can add bubble 
functions to V to form W. 

On the other hand, when finite element methods are implemented on parallel 
computer platforms, the data communication is simplified if one uses equal order 
finite elements for V and W. To stabilize for this, or to increase the stabilization 
even when one uses some stable element pairs, but where the constant cr is too 
small, one must use a stabilization term —aC, for some proper scalar a. 

The stabilization term should be active for the functions which are missing in 
an equal order finite element method compared to that for a Bi?-stable method. 
As such functions are of bubble type, they correspond to oscillating functions 
for which the choice C = —A, the Laplacian operator is active and gives large 
perturbations (0(/i“^)). On the other hand, the size of the coefficient a must be 
sufficiently small in order not to perturb the discretization order of the method. 
This indicates that a = 0{h?) is a, proper choice. 

This has been shown for the Stokes problem in [3]. Here, taking the divergence 
of the first equation 

-Au + Vj> = / 

using the incompressibility V • u results in the equation, 

-Ap = V • / 

which, multiplied by a constant a = 0{h^) gives 

—aAp = aV • / 

and corresponds to the regularization term 

—aCp = aV • /. 

Similar choices can hold for more general problems, see e.g. [6]. However, the 
form of the regularization term is problem dependent. 

5 Concluding Remarks 

It has been shown how saddle point problems can be solved efficiently by itera- 
tive solution methods using particular preconditioners. Thereby it is important 
to precondition both the top-block matrix and the Schur complement matrix 
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accurately, to achieve a method requiring few iterations. Stabilization is neces- 
sary to increase the smallest eigenvalue of the Schur complement matrix, when 
the constraint condition matrix is singular or nearly singular. In particular, it 
enables the use of simpler (equal order) finite elements in certain incompressible 
material problems (see [3]). 

A way to make it possible to accurately precondition the top matrix block 
when it arises from convection diffusion problems is to use a discretization 
method for which the arising matrix (M) is an M-matrix. Here incomplete fac- 
torization methods results in accurate factorizations if a node ordering based 
on the local flow directions is used, see [5] for an example where block matrix 
factorizations have been used. 



References 

1. Axelsson, O.: Preconditioning of indefinite problems by regularization. SIAM Jour- 
nal on Numerical Analysis 16 (1979) 58-69. 

2. Axelsson, O.: A generalized conjugate gradient, least squares method. Numerische 
Mathematik 51 (1987) 209-227. 

3. Axelsson, O., Barker, V.A., Neytcheva, M., Polman, B.: Solving the Stokes problem 
on a massively parallel computer. Mathematical Modelling and Analysis 4 (2000) 
1 - 22 . 

4. Axelsson, O., Marinova, R.: A hybrid method of characteristics and central differ- 
ence method for convection-diffusion problems. Computational and Applied Math- 
ematics 21 (2002) 631-659. 

5. Axelsson, O., Eijkhout, V., Polman, B., Vassilevski, P.: Iterative solution of singular 
perturbation 2nd order boundary value problems by use of incomplete block-matrix 
factorization methods. BIT 29 (1989) 867-889. 

6. Axelsson, O., Neytcheva, M.: Preconditioning methods for linear systems arising in 
constrained optimization problems. Numerical Linear Algebra with Applications 
10 (2003) 3-31. 

7. Peruggia, I., Simoncini, V.: Block-diagonal and indefinite symmetric precondition- 
ers for mixed finite element formulations. Preconditioning techniques for large 
sparse matrix problems in industrial application. Numerical Linear Algebra with 
Applications 7 (2000) 585-616. 

8. Rusten, T., Winther, R.: A preconditioned iterative method for saddle point prob- 
lems. SIAM J. Matrix. Anal. Appl. 13 (1992) 887-904. 

9. Wathen, A., Silvester, D.: Fast Iterative Solution of Stabilized Stokes Systems Part 
II: Using General Block Preconditioners. SIAM Journal on Numerical Analysis 31 
(1994) 1352-1367. 



A 3D Projection Scheme 
for Incompressible Multiphase Flows Using 
Dynamic Front Refinement and Reconnection 



Tong Chen^, Peter Dimitrov Minev^, and Krishnaswamy Nandakumar^ 



^ Department of Chemical and Materials Engineering 
University of Alberta, Edmonton, Alberta, Canada T6G 2G6 
^ Department of Mathematical and Statistical Sciences 
University of Alberta, Edmonton, Alberta, Canada T6G 2G1 
{tc8 ,minev,kumar .nandakumarjOualberta. ca 



Abstract. This paper presents a 3D finite element method for incom- 
pressible multiphase flows with capillary interfaces based on a (formally) 
second-order projection scheme. The discretization is on a fixed reference 
(Eulerian) grid with an edge-based local /i-refinement. The fluid phases 
are identified and advected using a level set function. The reference grid 
is then temporarily reconnected around the interfaces in order to main- 
tain optimal interpolations accounting for the singularities. This method 
is simple and efficient, as demonstrated by the numerical examples on 
several free-surface problems. 



1 Introduction 

Multiphase flows, including the free-surface flows and bubbly flows, occur in a 
wide range of geophysical, chemical and industrial processes. Among them are 
oil transportation, mixing in chemical reactors, cooling of nuclear reactors, aer- 
ation processes, film coating, plastic molding and biomechanics. In general, it 
seems that it is more efficient to solve such problems in an Eulerian reference 
frame rather than in a Lagrangian or Eulerian-Lagrangian framework. There are 
numerous techniques based on the Eulerian approach and the reader is referred 
to the recent and comprehensive reviews by Scardovelli and Zaleski (1999) for 
the volume-of-fluid method, Osher and Fedkiw (2001) for the level-set method, 
and Tryggvason et al. (2001) for the front-tracking method. The difficulties 
with multiphase flow simulations are mostly due to the singularities (discon- 
tinuous pressure and normal derivatives of velocity) across the time-dependent 
free boundaries. At present, the most popular way to deal with the singulari- 
ties at the interface, particularly in the finite difference context, is an approach 
based on a regularization of the 5-function using trigonometric approximation. 
However, such methods are at best only first-order accurate in space. The finite 
element method (FEM) for free-boundary problems has also undergone rapid 
development. Minev at el. (2003) have presented a 3D finite element technique 
based on a dynamic basis enrichment for pressure and velocity in Taylor-Hood 
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elements. In 2D, a simple but efficient projection scheme has been proposed by 
the authors (Chen et ah, 2003). In the present paper, we extend this scheme to 
3D and demonstrate its capabilities on several difficult numerical examples. 

2 Mathematical Formulation 

We consider a 3D flow domain J7 = f?i U f ?2 that contains two different Newto- 
nian fluids and 172, with constant densities {pi, P 2 ) and viscosities (^1,^2)- 
The fluids are assumed to be homogeneous, immiscible and incompressible, and 
separated by an interface F = dfi\C\dQ 2 - In each single-phase domain {i = 1,2), 
the fluid motion is governed by the Navier-Stokes equations which, in a Carte- 
sian coordinate system x = (x, y, z) with the z-axis pointing opposite to gravity, 
read (in a stress-divergence form) 

-I- (uiV)ui] = -VPj -I- V • (Tj, V • Uj = 0 in 17^ (2.1) 

where the hydrodynamic pressure Pi = Pi + pipz is the total pressure pi minus 
the hydrostatic pressure {—pipz). are the velocities and g is the gravity ac- 
celeration. (Ti = 2pie[\ii] is the deviatoric stress with e[ui] = | [Vu^ -f (Vu^)^] 
being the rate-of-strain tensor. 

Without the loss of generality, we assume Dirichlet boundary conditions on 
the external boundary dQ,Ui\dQ = Uf,. Based on the classical hypothesis that 
the surface tension is proportional to the mean curvature k of T, the balance of 
the total stress and the domain continuity results in the well-known interfacial 
conditions: 

[— pi -I- cr]n = — 7Kn and [u] = 0 on F. {2.2a, b) 

Here, I is the identity tensor. [•] represents the jump at F between the limiting 
values from the two sides of T, n = n2 = — ni is the unit vector normal to F 
pointing towards l7i from C2, and 7 is the coefficient of surface tension. The 
mean curvature is «: = V • n = (n x V) x n. 

The fluid properties {pi,pi) are identified using a continuous indicator func- 
tion (j) that satisfies 

-k (u • V)</> = 0. (2.5) 

It is convenient though if the fluid interface F corresponds to </) = 0, so that 

( fli if </i(x) > 0 

xG I F if ().(x) = 0 (2.6) 

[ 172 if </’(x) < 0. 

The normal direction to F is given by n = V(/>/|V^| at ^ = 0. 

3 Discretization 

With the time discretization, the Navier-Stokes equations in each of the single- 
phase domain 17^ can be split into a pure convection problem and a generalized 
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Stokes problem. In the present study, the former is solved using the Adams- 
Bashforth method and the latter is tackled with the second-order projection 
scheme of Kim and Moin (1985). Starting at time level n G [0, A^] where T = 
N At, the linearized fractional substeps are summarized as 



u,- = VL — At 



|«-v)<-i(ur'.v)ur' 



PiU* = piU* -h -Atv • {a* + o-r)> 



* lor? 



= u"+^+wr 



lor? 



V • (p-'V0r+') = V • <, n . = n • 



nw"+i=V0r+\ u"+i=u*- 



w. 



n+1 



(3.1a) 



(3.16) 

(3.1c) 

(3.1d) 



Here, is the convected velocity. 0i is an auxiliary variable related to the 
pressure by 

^ i p,V • <, (3.2) 

which is never explicitly computed. The free-surface boundary condition (2.2a) 
is split into [P] =^k+ [p]gz and [p 1^] = 0 as discussed by Chen et al. (2003). 

The weak formulation of (3.1a) is trivial and therefore we present only the 
formulations of the split generalized Stokes problem (3.1b-d), with the special 
treatment for the interfacial boundary conditions. Let us choose, in each of the 
single-phase domain Qi, the test functions for the velocity G and the 

test functions for the pressure Qi G Then the Galerkin formulations for 

(3.1b,c) and the L^-projection for (3. Id) yield, respectively 



/ 


Pi{u* - U,) • V, 


+ + u( 


) : Vv, 


II 

O 


(3.3a) 


jQi 




z 








f ■Vq,dQ = - 


f V-u*q.j,dn + 


[ • Uiq,ds, 


(3.36) 


Oi 




JOi 


'aOi 










v,df2 = [ • Vidl2. 


(3.3c) 




JOi 











The projection step (3.3c) makes the scheme equivalent to one of the stabilized 
schemes (at least in case of a single-phase flow) discussed by Godina (2001) and 
therefore allows for the use of an equal-order approximation for the pressure and 
velocity (see also Chen et al., 2003). 

The spatial discretization is performed on a fixed, Eulerian finite element 
grid (reference grid) which, however, is locally adapted (working grid) at each 
time step so that the interface F is always aligned with element faces i.e. it 
never crosses the interior of a finite element of the working grid at any given 
discrete time instant t”. An element on the reference grid that intersects with 
the interface and is to be reconnected to construct the working grid is called a 
front element. Furthermore, the basic reference grid can be refined prior to the 
adjustment of element faces with the interface. The pressure is double valued 
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in the nodes of the interface to account for its jump. We use P\ approximation 
for the velocity space on the working grid, which automatically accounts for 
the discontinuity of the first derivatives of velocity. Equation (3.1a) as well as 
the equation (2.5) for the indicator (j) are discretized on the reference grid while 
(3.1b-d) are discretized on the working grid. 

Let be the Pi shape function for the velocity, and <l>a for the pressure, 
corresponding to a point a. The subscript h denotes a discrete quantity and by 
definition 



fjL = gLi, p = Pi, O = Oi, u = u* and u = in fii {i = 1, 2). 



Imposing the condition [u)(] = 0, the unified weak formulation (3.3a) becomes 




u/i) • 4>s 



-Atpe(ul + ul) : 



dQ = 0. 



(3.4) 



Here, is subject to the external boundary condition \an = + wjjlar?. 

With • n] = 0, which is a direct consequence of [u)(] = 0 and = 0, 

the weak formulation (3.3b) reads 



f Pi^VO^X^ ■ + [ P2^yO^X^ ■ = - / V • ul<Padf2. (3.5) 

J J J f2 

Equation (3.5) contains two degrees of freedom for the pressure at each node 
that belongs to P. One of them can be condensed using the boundary condition 



6)"+iL = 0 



2,h Ir 



n+1 I 

i.h I r 



(3.6) 



which appears to be a periodic- like boundary condition corresponding to the two 
fluids in and f? 2 - Here, [0] is given by 



[0] = At(K-f + [p]gzr + x[/^V • <]) 



due to (3.2). We can compute [0] pointwise as explained later. Let us consider 
one interface node, J, and denote by Nj are all the nodes linked to J through 
the elements containing J, including J itself. Let also Nj be the subset of Nj 
belonging to P from 1?2- Substituting (3.6) into (3.5) for node J so as to eliminate 
(condense) the values from f? 2 , we end up with 



leNj 

(3.7) 

= -fn^- K^jdn - E [0]k Ja, P2 ^^k ■ \7<Pjdn. 

KeN^ 

Likewise, we can obtain the discrete formulation for (3.3c) with the help of 

[wr^] = 0 : 

[ pwi+^ ■ ^adn = - [ v0(*+i • ^adn - [ V0^+^ • ^adn, 
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of which the right-hand side can also be rewritten similar to (3.7): 

f pwi+^ -^jdn = - [ v0;)+i • - V [o]k f \ 7 d>K -^jdn. (3.8) 

J f2 J Q ^2 

In summary, equations (3.4), (3.7) and (3.8) form the unified algebraic system 
for the generalized Stokes problem on 17. Obviously, it reduces to a conventional 
single-phase flow solver on either the reference grid or the working grid when 
[0] = 0. For the purpose of pressure condensation, we need to evaluate the 
jump [0] pointwise in all the nodes J on F. This is obtained by a localized least 
squares method as explained in Chen et al. (2003). 

Local refinement of mesh is a powerful tool for reducing the computational 
cost, especially in three dimensional and multiphase flow simulations where 
most effects are concentrated around the free interface. We applied the local 
/i-refinement technique (Ruprecht and Heinrich, 1998) to the front elements and 
their immediate neighbouring elements of the basic reference mesh. It is an edge- 
based subdivision of tetrahedron, very fast for time-dependant problems if the 
permanent edge information is stored prior to the time marching. The refine- 
ment is conformal in the sense that all hanging nodes are avoided according to 
the built-in templates. The values of a mid-edge (inserted) node will be inherited 
from the previous time level if it is refined previously; otherwise, linear averaging 
is applied. 

The identification of the front elements on the /i-refined reference mesh is 
straightforward because in such elements the level set function f would change 
sign in two vertices of an edge. Any variable f to be interpolated at </> = 0, 
including the position of intersection (xj-), can be determined using the linear 
finite element interpolant of f along the edge (say, AB): 

ipr = ipA + - iPa), 

where a = 4 >a/{(Pa ~ (Pb) G [0,1]. It introduces dynamically new degrees of 
freedom to the system for both the velocity and pressure along the interface F. 
Generally, the interface (a plane in a linear tetrahedron) will cut the element, 
as shown in Fig. 1, in two configurations. In case (a) the tetrahedron is cut into 
one tetrahedron and one prism, while in case (b) it is cut into two prisms. Each 
of the prisms is further subdivided into three tetrahedra by trying to use the 
better diagonal combinations of the quadrilateral faces without violating the 
connectivity rules. Since these are the most common situations, the algorithm 
is dramatically simplified if it is not allowed that the intersection coincides with 
a vertex of the front element and then these would be the only two possible 
situations left. This restriction is enforced in the following manner. If the distance 
between xr on edge AB and its closest vertex (A/B) is smaller than a preset 
minimum allowable distance, then this point is fictitiously moved away from 
this vertex (without resetting cj>) on this distance along the edge. This is done by 
limiting a G [<5, 1 — <5] where <5 is a small positive number less than one half. This 
deviation introduces an 0{h^) error in the approximation of the zero-th level set. 
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Fig. 1. Reconnection of the front elements (pressure is double valued at intersection 
points 5, 6, 7, 8). The prism is further subdivided into three tetrahedra by linking the 
shorter diagonals of the quadrilateral faces. 



which is consistent with the overall accuracy of the method. 6 = 0.01 for all the 
examples presented here. The choice of S can affect the quality of the grid and 
therefore different values may be recommendable for other physical situations. 

4 Numerical Examples 

The present technique was quantitatively verified using the problem for bubble 
oscillation in a low-viscosity and initially quiescent liquid at zero gravity. We 
performed three tests starting with an initially ellipsoidal bubble {x^ ja^+y'^ 1^^+ 
j(? = 1, a = h = 0.55, c = 0.4132) whose volume is equal to that of a sphere 
of radius R = 0.5. The problem was solved in 1/8 of the domain (0 < x < 1, 0 < 
2 /<l, 0 <z<l)by imposing symmetric conditions on the axis planes. The 
basic reference grid used for these simulations was a uniform grid containing 10^ 
uniform cubes, each subdivided into 6 Pi tetrahedra. The dynamic /i-refinement 
gives an effective resolution of 21^ nodes. The densities were chosen to be pi = 
0.001, p2 = 1, and the viscosities, fii = 0.0002 and p 2 = 0.01. The surface 
tension coefficients in the three cases were 7 = 1.0, 0.5 and 0.2 correspondingly. 
The computed periods of oscillation are 0.81, 1.15 and 1.83 correspondingly, and 
they agree well with the classical potential flow results of Lamb (1932): 0.79, 
1.11 and 1.76. 

The next testcase is about the Rayleigh-Taylor instability in a three dimen- 
sional rectangular box with a square horizontal cross section. The height-width 
aspect ratio was fixed at H = 4W. In order to compare our results with those 
of He et al. (1999), the surface tension was neglected and the viscosities were 
chosen to be the same for both the heavy and the light fluids. We solve the 1/4 
domain symmetric with respect to the x = 0 and y = 0 planes. Periodic bound- 
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ary conditions were applied at the four vertical sides, while no-slip boundary 
conditions were applied at the top and bottom walls. The instability develops 
from the imposed single-mode initial perturbation 

h{x,y) = 0.05[cos(27ra;/IT) -I- cos( 27 t?//FF)] 

at 3/4 of the height above the bottom. The parameter setting is pi = 3,p2 = 
I, Pi = P 2 = 0.001,7 = 0 and g = 1, which gives effectively an Atwood number 
At = {pi— p 2 )/{pi + P 2 ) = 0.5, a Reynolds number Re = p 2 {gLY^^W/ p 2 = 1000 
and a Fronde number Fr = 1. The domain was resolved with the basic reference 
grid 16 X 16 X 128 nodes in right-angle tetrahedra. The effective local (around 
the interface) resolution is doubled after the one-level h-refinement. The interface 
evolution as the fluids penetrate each other is presented in Fig. 2. It is seen that 
the heavy fluid falls down in a relatively narrow “spike” forming a mushroom- 
shaped end while the lighter fluid rises upward as a large “bubble” . The existence 
of four “saddle points” at the middle of the vertical sides distinguishes it from a 
two-dimensional Rayleigh-Taylor instability. In Fig. 3 we present the time evo- 
lutions of the spike tip, bubble front and the saddle point. The data correlate 
favourably with the higher-resolution (128 x 128 x 512, full domain) parallel 
computations of He et al. (1999) using the lattice Boltzmann method. 




Fig. 2. Evolution of the fluid interface due to the Rayleigh-Taylor instability from a 
single- mode perturbation [At = 0.5, Re = 1000, Fr = 1). 
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Fig. 3. Positions of the bubble front, spike tip and saddle point versus time {At = 
0.5, i?e = 1000, Fr = 1). 
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Abstract. The aim of this contribution is to propose and analyze some 
computational means to approximate solving mathematical problems ap- 
pearing in some recent studies devoted to biological and chemical net- 
works. 



Basic mathematical tools for investigating some biological and chemical networks 
as presented in [7,8] are recalled in Section 1. In Section 2 some variants of 
iterative Schwarz-like methods studied in [1,11] are described and applied to 
problems discussed in Section 1. An analysis and comparison of two particular 
Schwarz-like methods is presented in Section 4. 

1 Cooperative Systems 

In [7] a theory for linear problems of the type 

^w{t) = Tw{t), w{0) given, (1) 

at 

where T is a given infinitesimal generator of a semigroup of operators is de- 
veloped. Several examples mainly from biology and chemistry where the state 
vectors w{t) of the underlying chemical network follow an evolution (1) are pre- 
sented there. In these cases the conservation of matter requires the existence of 
an element / such that the pairing [w(t),f] is constant during all times of the 
evolution so that we have 



[w{t),f] = [w{0),f],t>0. (2) 

More complicated networks (and some examples are shown [8]) are described 
by a state vector u(t) which is formed by finitely many substates in the fashion 

u{t) = (3) 
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following an evolution 

j —u^ (t) = T^^\u{t))u^ (t) ■.= (t) + {u{t))u^ (t) 

\ 0.1/ V ^ / 

I [u^{t),p] = [u^{0),p], t>0, j = l,...,N. 

Hence, the subsystems evolve like (1) and (2) for which we developed the theory 
in [7]. The dependence of the operators on the data is typically on the 
complete state (3) rather than only on some substates. 

If S , j = for the states of the subsystems, the product 

X = X^ X X'^ X • • • X X^ can be formed and the “block” -diagonal operator 

H = diag{B(i),...,H(^)}, G(m) = diag{G(^^(w),...,C?(^)(it)} (5) 

can be defined on X. Now (3) evolves according to 

^u{t) = Bu{t) + G{u{t))u{t), [u{t)J] = [m(0), /], t > 0, (6) 

where / = (/^, . . . , /^) G X and B is the infinitesimal generator of a semigroup 
of operators of class Cq [9, p.321]. Note that structurally (6) is very similar to 
(1), (2). Only the infinitesimal generator G = G{u) itself depends on the total 
state (3) so that our problem becomes nonlinear. 

In [8] the evolution problem was assumed in the form (6) under very general 
conditions and an existence theorem (concerning mild solutions which in our 
applications become classical solutions) was proven for all times t > 0 and the 
question of its long run behaviour settled: In fact, any solution of (6) settles 
in the long run at a steady state. Defining equations to determine this steady 
state are given there. This theorem is important not only in its own right. It also 
provides the basis for singular perturbation techniques on such systems to obtain 
analytic expressions which characterize the speed of reaction systems, and we 
refer to [4], [17] for the pseudo-steady state process leading to the definition of 
the speed of the underlying chemical networks. Note that the theory does not 
need the diagonal form of (5). However, it relies heavily on (2). Relation (6) is 
only an easy way to fit (3), (4) under the general pattern (10). The original form 
of the examples is (4). 

As in the earlier papers [7], [8] we need the theoretical basis for infinitesimal 
generators with monotonicity properties. The corresponding basic definitions are 
summarized in this section. 

Let £ be a Banach space over the field of real numbers. Let £' denote the 
dual space of £. Let T , T' be the corresponding complex extensions of £, £' 
respectively and let B(£) and B(lF) be the spaces of all bounded linear operators 
mapping £ into £ and T into T respectively. In fact, we are going to provide 
our investigations in a Hilbert space £ equipped with an inner product [.,.]. 
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Let /C C £ be a closed normal and generating cone, i.e. let 

(i) /C -h /C C /C, 

(ii) a/C C K. for a G 7Z+, 

(iii) ^n(-/C) = {0}, 

(iv) _/C = /C, 

where /C denotes the norm-closure of /C, 

(v) £ = /C-/C, 

and 

(vi) there exists a (5 > 0 such that \\x + y\\ > i5||x||, whenever x, y G JC. 
Property (vi) is called normality of 1C. 

We let 

X <y or equivalently y h x <1=^ {y — x) G K. 

(vii) For every pair x,y G K. there exist x/\y = inf{x, y} and xVy = sup{x, y} 
as elements of 1C. 

A cone 1C satisfying condition (vii) is called a lattice cone and the partial 
order on £ a lattice order. In the terminology of H.H. Schaefer [14] £ is called a 
Banach lattice. Our theory is free of hypothesis (vii). 

Let 

1C' = {x' G £' : x'(x) > 0 for all x G K.} 

and 

IC^ = {x G K. ■. x'(x) > 0 for all 0 yf x' G K,'} . 

We call 1C' the dual cone of 1C and the dual interior of 1C, respectively. If 
£ happens to be a Hilbert space then 1C' is replaced by 1C* a representation of 
1C' in the sense of natural isomorphism of the dual £' with £. 

In the following analysis we assume that the dual interior is nonempty. 

A set Ti' C 1C' is called /C-total if the following implication holds 

x'(x) > 0 Vx' G hi! X G 1C. 



A linear form x' G 1C' is called strictly positive, if x'(x) > 0 for all x G 1C, 

X yf 0. 

We write [x,x'j in place of x'(x), where x G £ and x' G £' respectively. If £ 
happens to be a Hilbert space then [x, x'] denotes the appropriate inner product. 

A bounded linear operator T G B(f ) is called /C-nonnegative if T/C C K,. We 
write in this case B >0 and equivalently Q < B.liT and S both in B(f ) satisfy 
(S' — T)1C C 1C we write T ^ S or equivalently S ^ T. 

If T G B(S) then T' denotes its dual and hence, T' G B(£'). In case S is a 
Hilbetrt space, the dual operator T' is to be replaced by the adjoint operator 
T* defined via relations [Tx, y\ = [x, T*y\ valid for all x in the domain of T and 
y in the domain of T* . 
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Definition 1. Operator B S B(£) is called /C-stochastic if there exists a vector 
x' S K.' such that for the dual map T' the following relation 



T'x’ = x’, 



holds. We also say that T is a transition operator of a Markov chain or process 
and that operator T corrersponds to vector x' . If £ is a Hilbert space the dual 
T' is to he replaced by its adjoint T* . 

Definition 2. A bounded linear operator B is JC -irreducible if for every pair 
of elements 0 ^ x G £,0 ^ x' G £' , there is a positive integer p = p{x,x') 
such that x'{B^x) 0. This implies that in the Markov chain each state has 

access to every other state, i.e., the chain is ergodic [15]. The Perron- Frobenius 
theorem states that for B > O irreducible, p{B) is an isolated eigenvalue, and 
the corresponding eigenvector is positive; see, e.g., [3[. 

Let T € B(J^) and let cr(T) denote its spectrum. Further, let T G B(f). We 
introduce the operator T by setting Tz = Tx + iTy, where z = x -\- iy, x,y G £ 
and call it complex extension of T . By definition, we let u(T) := cr(T). Similarly, 
we let r(T) := r('T), where r{T) = max{|/r| : p G (j{T)} denotes the spectral 
radius of T. 

In order to simplify notation we will identify T and its complex extension 
and will thus omit the tilde sign denoting the complex extension. 



is called peripheral spectrum ofT. Note that cr,r{T) is never empty. 

If p is an isolated singularity of i?(A, T) = {XI — T)~^ we have the following 
Laurent expansion of R{X,T) around p [16], [12] 



where A^-i and B^, k = 1,2, . . ., belong to B(iF). Moreover, it holds [16] 



The set 



CT 7 t(T) = {pG a{T) : \p\ = r{T)} 



OO 



R{X, T) = ^ Ak{p){X - + E (^) 




( 8 ) 



where 



Co — {A : |A — ^1 — po} 

and Po is such that {A : |A — /r| < po} n a{T) = {p}. 
Furthermore, 



Bk+i{p) = {T - pI)Bk{p), k= 1 , 2 ,... 



( 9 ) 
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If there is a positive integer q = q{^) such that 

Bq yf 0, and Bk = 0, for k > q, 

then ^ is called a pole of the resolvent operator and q its multiplicity. 

We define the symbol 

ind (/iJ -T) = q{p) 

and call it the index of T at p,. In particular, we call ind(T) the index of T 
instead of index of T at 0. 

The motivating examples are particular cases of Problem (P) defined as fol- 
lows: 

Problem (P) To find /C-positive solutions 

= Bu{t) + G{u{t))u{t) = T{u{t))u{t), m(0) = uq, (10) 

where B is generally an unbounded linear densely defined operator and G{u) 
for every u G £ is a bounded linear map on £, where £ denotes the underlying 
space to be specified in each particular situation. We assume that we can identify 
situations in which Problem (P) as formulated above possesses solutions and we 
are aware of conditions guaranteeing the existence of them as well as some of 
their properties such as uniqueness, asymptotic behaviour etc. A rather typical 
representative of such problem is described in our study [8] . Since our aim in the 
present contribution is to propose some algorithms of computational nature and 
analyse their properties we do not go into much details referring the reader to 
[8] to consulting general aspects. All properties needed for a good understanding 
of the numerical processes studied will be presented here. 

2 Schwarz Iterative Methods 

In this section we present some notations, definitions, and preliminaries. Anal- 
ogous concepts on nonnegative matrices (defined here for generally infinite di- 
mensional spaces) can be found in the standard reference [3]. 

By cr(C') we denote the spectrum of G and by r{C) its spectral radius. By 
7?.(C) and M{G) we denote the range and null space of C, respectively. 

Let A e cr(C) be a pole of the resolvent operator R{p, C) = {pi — G)~^ . The 
multiplicity of A as a pole of R{p, C) is called the index of G with respect to 
A and denoted ind\C. Equivalently, q = ind\C if it is the smallest integer for 
which 7^((A/ - C)«+i) = 7^((A/ - C)9). This happens if and only if 7^((A/ - 
C)«)0Af((A/-C')«) = f. 

Definition 3. Let A he a densely defined (A{A) being its domain of definition) 
and spectrally bounded from above linear operator and let a G R} he its hound, 
i.e., A € cr{A) implies KA < a. Assume that for there exists a system of operators 
{Afi : 0 < h < ho}, Ah G B(£), such that relation 

lim \\AhX — Ax\\ = 0 



( 11 ) 
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holds for every x € Operator A is called approximate K, — M-operator if 

each of the operators A^ in the collection mentioned has the form Ah = bl — Bh 
with b > r{Bh) and each Bh being IC-nonnegative. A pair of operators (AI,W) 
is called a splitting of A if A = M — W and M~^ exists as a hounded linear 
operator on £. A splitting of an operator A is called of /C -nonnegative type if 
the operator T = M~^W is IC-nonnegative [10]. If, in particular, both operators 
M~^ and W are IC-nonnegative, the splitting is called regular [18]. If M~^ and 
T = M~^W are nonnegative, the splitting is called weak regular [13[. 

Note a weak regular splitting does require explicitely no conditions upon part 
W of the splitting of A. 

Let r be a bounded linear operator. T is called convergent if lim exists 

k — >00 

and zero-convergent, if moreover lim = O. Standard stationary iterations of 

k — KXD 

the form 

x'^+^ = Tx^ + c, fc=0,l,..., (12) 

converge if and only if either T is zero-convergent or, if p{T) = 1, T is convergent. 
A bounded linear operator T with unit spectral radius being an isolated pole of 
the resolvent operator is convergent if the following two conditions hold: 

(i) if A S <j{T) and A yf 1, then |A| < 1. 

(ii) ind\T = 1. 

Equivalent conditions for (ii) can be found in [3]. 

It is useful to write T = Q-\-S, where Q is the first term of the Laurent expansion 
of T, i.e., the eigenprojection onto the invariant subspace corresponding to A = 1; 
see, e.g., [16]. Then = Q, QS = SQ = O, and 1 ^ <x(S). This is called the 
spectral decomposition of T. The condition (i) above is equivalent to having 
p(S) < 1. 

We state a very useful Lemma; its proof can be found, e.g., in [5]. We note 
that when r{T) = 1, this lemma can be used to show condition (ii) above. To 
prove convergence one needs to show in addition that condition (i) also holds. 

Lemma 1. Let T he a IC-nonnegative bounded linear operator such thatTv < av 
with V > 0. Then r(T) < a. If furthermore r(T) = a is a pole of the resolvent 
operator {XI — T)~^ , then indaT = 1. 

3 Algebraic Formulation of Schwarz Methods 

In this Section we want to generalize convergence of some procedures well known 
as algebraic Schwarz iteration techniques [1], [11]; our aim is however to ana- 
lyze solving equations in infinite dimensional spaces as well. Given an initial 
approximation x^ to the solution of 

Ax = b, b £ £, (13) 

the (one-level) multiplicative Schwarz method can be written as the stationary 
iteration 



= Tx^ + c. 



(14) 
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where 

1 

T = T^ = {I- Pp){I - Pp_i) Pi) = (15) 

i—p 

and c is a certain vector in £. Here 



P, = R*{R,AR*)-^R,A, 



(16) 



where Rj is a suitable linear operator and R* its adjoint with respect to the inner 
product in the Hilbert space £. Note that each Pi, and hence each I — Pi, is a 
projection operator; i.e., (/ — P^)^ = I — Pi. Each I — Pi naturally has spectral 
radius equal to 1. 

The additive Schwarz method for the solution of (13) is of the form (14), 
where 

p p 

T = Te = I-9Y,P^=I-SY.RU-^R^A, (17) 

i=l i=l 



where 0 < 0 < 1 is a damping parameter. 

The operator Ri corresponds to the restriction operator from the whole space 
to a subset of the state space (usually of finite dimension rij,j = 1, . . . ,p; the 
dimension of the range R{Ro) is infinite in general) in the domain decomposition 
setting, and the operator Ai = RiARf is the restriction of A to that subset. A 
solution using Ai is called a local solver as in the domain decomposition method 
as well as in the algebraic case. 

We assume that our standard choice of operators Rj follows the same idea 
as does the choice of the rows of Ri as rows of the n x n identity matrix / in 
case of £ = 77." , e.g.. 



R, 



00000010 

01000000 

00010000 



Let us assume that the Hilbert space £ with which we provide our analyses is 
partially ordered by a closed normal cone generating £, i.e. £ = 1C — 1C. Moreover, 
let subsets HjClC' j = 1, . . . ,p, exist such that 






7=1 



is /C-total and x G £j holds if and only if there exists an element G 77' such 
that x'^(x) yf 0. In order to allow overlaps of individual reduction maps Rj we 
do not assume that £j n yf 0, j yf k, £j =range(Py) and £k =range(Pfc). 
Obviously, 0 x G £j C] £k A and only if there are x' G 77' and y' G 77(, such 
that x'{x)y'{x) yf 0 and £j H £k = {0} if and only if x'{x)y'{x) = 0 whenever 
x' G R-'j^y' G H'l. and x G £j\C £k- 

Formally, space £ can be considered as a direct sum 



£ — ^7 O ^—j 7 
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where j is the “complementary” subspace to £j . We also have a corresponding 
decomposition of operator A given by formula 



A = 






where Aj maps £j n A{A) into Sj and A-j maps £-j n A{A) into £-j. 

We assume that the maps 

^3 ~ 

are diagonal operators, i.e. 

fijl :< Ej ^ i3jl (18) 

for some real pj,Vj,j = In fact, we expect that similarly as in the 

finite dimensional case operators £j will signal whether the method chosen does 
possess some overlaps and how extensive they are. 

It is easy to see that as in case of finite dimensional situation both Ai and A^i 
are JC — M-operators [3]. For each i = 1, . . . ,p, we construct diagonal operators 
Ei G associated with Ri chosen 



Ei — Rf Ri . 



(19) 



If A is an /C — M-operator for each i = l,...,p, we construct a second 
collection of operators Mi associated with Ri as follows 



M, = 



A^ O 
O ’ 



( 20 ) 



where 

D^i > O (21) 

is invertible diagonal operator representing a “diagonal” of A. 

The following result comes from [1]. 

Proposition 1. Let A be a nonsingular JC — M-operator. Let Mi be defined as 
in (20). Then the splittings A = Mi — Wi are regular (and thus weak regular and 
of nonnegative type). 

In the cases considered in this paper, we always have that Mi defined in 
(20) are nonsingular. With the definitions (19) and (20) we obtain the following 
equality 

E,M-^ = RjA-^R,, i = l,...,p. (22) 

We can thus rewrite (15) as 

T=T^ = {L- EpM-^A){L - Ep_iM~\A) •••(/- E^Mf^A). (23) 
Similarly, (17) can be rewritten as 



p 

T = Tg = I -e^E,M~^A. 
2=1 



( 24 ) 
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This is how we interpret the multiplicative and additive Schwarz iterations. 

In [1] it was shown that when A is nonsingular, r(T^) < 1, and thus, the 
method (12) is convergent. Furthermore, there exists a unique splitting A = 
M — W such that T = T^ = M~^W. This splitting is a weak regular splitting. 

In this paper we want to explore the convergence of (12), using the iterations 
defined by (23), (24), when A is singular. 



4 Two Approaches to Approximate Solving 
Stationary Equation 

We are going to describe two ways for constructing solutions to stationary equa- 
tions. The first one is based on a application of one of Schwarz algorithms directly 
to the stationary equation, the second one utilizes the asymptotic behaviour 
of solutions of the appropriate evolution. The second alternative is suitable in 
particular for problems when no additional information is available such as irre- 
ducibility etc. 

As classical properties of M-matrices suggest strict positive diagonals may 
strongly influence convergence and the speed of convergence of the investigated 
iterative proccesses. We recall a useful 

Lemma 2. Let T S 5(15) satisfy TIC G IC and T > al with some real a > 0. 
Then (l/r(T))T = S is eonvergent. 

Proof. Obviously, r{T) > a and hence, r(S') = 1. For completing the proof we 
need to show that 1 is a unique spectral point with modulus r{S). Let T = al+V. 
By hypothesis, VK. C K. and A G cr{T) can be written as A = a -I- /r, /r G cr{V). 
Hence, r{T) = a + r{V). Let = c + di with reals c, d and = —1, = 

r{V)^. It follows that 



|Ap _ -I- 2ac-|- r(P)^ _ 

“ {a + r{V)y + 2ar{V) + r{yY 

only \i c = r(y) or, equivalently, if /r is real and hence A positive. This completes 
the proof of the Lemma. 

In [8] it has been shown that a solution u = u{t) to Problem (P) gets sta- 
tionary and m(-|-oo) is a solution to 

B {au{+oo) + pPuo) + G{u{+oo))u{+oo) = 0, (25) 

where a and /3 are suitable nonnegative reals and uq the initial condition in (10). 
Since, by hypothesis. 



-Bh = Ch- cl, G{u)\h = F{u)\h - 

GJC C /C, F{u)\hIC C /C, c > r{Gh),f{u) > r{F{u)\h), 
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where Ch and F{u)\h denote the appropriate discretizations of C and F(u) re- 
spectively, it is an easy matter to show that 

, {Ch + F{u))\f,)=T{u)\f, (26) 

r[Ch) + r{H(u)\h) 



satisfies 

T{u)‘;^xl=xl r G/C*. 

Thus, in view of Definition 3, T{u) is /C-stochastic corresponding to vector x* G 

Theorem 1. Let £ be a Hilbert space with an inner product LetA = I — B, 
where B G B(if) is a JC-stochastic operator such that Bv = v with v G ■ Let 
p > I be a positive integer and A = Mi — Wi be splittings of nonnegative type 
such that the diagonals of Ti = M~^Ni, i = 0, 1, . . . ,p, are positive. Then 

T = T^ = (I- EpM-^A){I - Ep_iM-}^A) •••(/- E^M^^A) 

and T = Tp ... Ti are convergent operators. Furthermore, there is a splitting of 
nonnegative type 

A = M -W (27) 

such that T = M~^W, and the iteration operator T possesses the following prop- 
erties: 

T = Q + S, = Q, QS= SQ = 0, r{S) <1, (28) 

and 

AQ = O. (29) 

The existence of a splitting of nonnegative type, and properties (28) and (29) 
also hold for T. 

Proof. We begin with the operator T. Let w > 0 be such that Bv = v, i.e., 
Av = 0. For each splittings of ^ — Wi, we then have that MiV = NiV. 

This implies that Tv = v, and by Lemma 1 we have that r(T) = 1 and that 
the index is 1. To show that T is convergent, we show that T F aL for some 
real a > 0. This follows from the fact that each of the operators Ti satisfies 
relation Tj F ajL with positive reals oq, . . . , Op. We follow a similar logic for the 
multiplicative Schwarz iteration operator (23). Since Av = 0, Tv = v, and thus 
r(T) = 1 and indiT = 1. Each factor in (23) can be written as 

I-E, + E,{I - M~^A) =I-E, + E,M~^Wi, 

and since O < Ei < L and M~^Wi > O, each factor is nonnegative. For a row 
in which Ei is zero, the diagonal entry in this factor has value one. For a row in 
which Ei has value one, the diagonal entry in this factor is the positive diagonal 
entry of M~^Wi. Thus, again, we have a finite product of /C-nonnegative oper- 
ators, each dominanting a positive multiple of the identity operator, implying 
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that the product T does dominate a positive multiple of the identity operator 
too, and therefore it is convergent. 

The rest of the proof applies equally to T and T, we only detail it for T. 
The matrix T being convergent implies the spectral decomposition (28), where 
Q is the spectral projection onto the eigenspace of T corresponding to r{T) = 1. 
Furthermore since T > O, Q = lim > O. 

k—*oo 

We show now that M{I — T) = Af{A). According to construction of T, the 
null spaces satisfy A/”(A) C Af{I — T). Any element of y S N{I — T) which 
does not belong to N{A) has to have a form y = Ax for some x and y yf 0. 
Since Q > O, we have that y > 0. On the other hand y^e = x'^ A*e = 0, a 
contradiction. Since we then have that N{I — T) = JV{A), the existence of a 
splitting of the form (27) follows similarly as does the finite dimensional analog 
of Theorem 1 from Theorem 2.1 of [2]. The fact that T > O indicates that this 
splitting is of nonnegative type. 

With this splitting, using (28) the following identity holds AQ = M {I — 
T)Q = O, so we also have (29). 

An example of splittings that lead to iteration matrices satisfying the hy- 
potheses of Theorem 1 is described in the following proposition requiring no 
proof. It provides a possible modification to the local solvers, when the iteration 
operator defined by (20) does not dominate a positive multiple of the identity 
operator. 

Proposition 2. Let B > O, B*x' = x! . Let ai, . . . , ap, be any positive real 
numbers. Let A = I — B = Mi — Ni, z = 0, . . . ,y, be defined by 



M,= 



Oiil ~\~ Ai 0 

0 Uil + D^i 



(30) 



and Wi = Mi — A, where D^i are defined in (20)-(21). Then, the splittings are 
regular, and iteration operator Ti = M~^Wi i = 0, ...,p dominate a positive 
multiple of the identity operator. 



Let us recall that we want to compute the stationary state, i.e. a solution of 
the system (25) This problem can be reformulated in terms of a new iterative 
process with the generating nonnegative operators C and F(u(f)): 

u ((fc -k l)r)) = {Ch + F{u{kT))\hU {{k + l)r)) , u(0) = uq. (31) 



An alternative to the method of approximate solving stationary equation of 
Problem (P) as described in Theorem 1 is to compute the limit limfc_,oo M(fc7") 
using process (31). 

From a variety of possible choices of rational approximations of the exponen- 
tial we choose one from the class of limited Fade approximations, say 



R{z) 



Pji^) 

(1 — 'yz)'} ’ 



q>2 
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with appropriate real 7 and polynomial Pj of degree j . Denoting 

= c + ^f{u) ^ 

we want to compute according to (31) 

(/ - -fT{Lk)\hY u{{k + fc = 0, 1, . . . (32) 

The above process can be implemented as follows. Let us omit the discretiza- 
tion parameter index and set = u{kT) and 

yk+i/q = (/ — yrLfe) u{kr), = (j _ 7rLfc)‘^“^ u{kr), /c = 0, 1, . . . 
Then 

(7 - yrLfc) = Pj{TLk)v^{kT), 

(7- 7rLfc)r;''+^/'J(fcr) = Pj(rLfe)n''+^/‘?(fcr), 



(7 - 7rifc) = P, (rLfe)z;'=+(«-i)/«(fcr). 

Convergence of the method just described is an easy consequence of the fact 
that the operators {7 — T^{Lk)\h\, k = 1,2,... are nonsingular K, — M-operators 
in the spirit of Definition 3 and the convergence results of [1] . Actually, we have 

Theorem 2. Let £ he a Hilbert space over the reals generated by a closed normal 
cone 1C. Assume B is a generally unbounded linear operator defined on a dense 
domain T> G £ and its adjoint satisfies relation B*x* = 0. We assume further 
that B generates a semigroup of operators of class C such that T(t; —B)IC C 
JC,t > 0. Finally assume that for any u £ P 0 K, operator G(u) £ B(£) is an 
K. — M -operator satisfying [G(u)]*x* = 0. 

Then the iteration process (32) returns a sequence {u(fcr)} of approximations 
to a unique solution to Problem (P) such that 

lim u{kr) = w(-|-oo). 

fc— ^00 



5 Conclusions 

Nowadays it is accepted by the community of numerical analysts that two- and 
generally multi-level iterative methods offer an essentially broader variety of 
tools to solve large scale computational problems. In this context Schwarz and 
Schwarz-like methods play a quite important role. Many contributions of many 
authors document this statement as a rule by investigating problems character- 
ized by nonsingular operators. An emphasis of our approach is just the opposite 
that is to problems with singular operators. Another goal of our analyses is that 
we consider the computational problems in their original form, i.e. we work with 
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generally infinite dimensional objects and let discretizations to be made at an 
appropriate moment, e.g. at each iteration step. 

We apply our Schwarz-like methods to a problem coming from stochastic 
modeling in biology and chemistry. We can thus profit from having nicely struc- 
tured operators but suffer of approaching problems with hardly accessible data. 
We thus propose two methods each suitable in the corresponding situation. The 
first method assumes all data accessible and the second just the opposite. In 
particular, the method using auxilliary time evolution does require no apriori 
knowledge of location as well as access to each single data. As example let us 
mention irreducibility of the operators of the model and access to matrix ele- 
ments of appropriate discretizations. The latter is compensated by ability of our 
method of an easy computation of corresponding matrix actions. 
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Abstract. Monte Carlo applications are widely perceived as compu- 
tationally intensive but naturally parallel. Therefore, they can be ef- 
fectively executed using the dynamic bag-of-work model which is well 
suited to parallel, distributed, and grid-based architectures. This paper 
concentrates on providing computational infrastructure for Monte Carlo 
applications on such architectures. This is accomplished by analyzing 
the characteristics of large-scale Monte Carlo computations, and lever- 
aging the existing Scalable Parallel Random Number Generators (SPRNG) 
library. Based on these analyses, we improve the efficiency of subtask- 
scheduling by implementing and analyzing the “N-out-of-M” strategy, 
and develop a Monte Carlo-specific lightweight checkpointing technique, 
which leads to a performance improvement. Also, we enhance the trust- 
worthiness of Monte Carlo applications on these architectures by utilizing 
the statistical nature of Monte Carlo and by cryptographically validating 
intermediate results utilizing the random number generator already in 
use in the Monte Carlo application. All these techniques lead to a high- 
performance grid-computing infrastructure that is capable of providing 
trustworthy Monte Carlo computation services. 



1 Introduction 

Readers are most likely already familiar with parallel and distributed computing 
architectures, so we choose not to elaborate on them further. Thus, grid comput- 
ing is characterized by the large-scale sharing and cooperation of dynamically 
distributed resources, such as CPU cycles, communication bandwidth, and data, 
to constitute a computational environment [1]. In the grid’s dynamic environ- 
ment, from the application point-of-view, two issues are of prime importance: 
performance - how quickly the grid-computing system can complete the submit- 
ted tasks, and trustworthiness - that the results obtained are, in fact, due to the 
computation requested. To meet these two requirements, many grid-computing 
or distributed-computing systems, such as Condor [2], HARNESS [3], Javelin [4], 
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Globus [5], and Entropia [7], concentrate on developing high-performance and 
trust-computing facilities through system-level approaches. In this paper, we are 
going to analyze the characteristics of Monte Carlo applications, which are a 
potentially large computational category of parallel, distributed, and grid ap- 
plications, to develop approaches to address performance and trustworthiness 
issues at the application level. 

The remainder of this paper is organized as follows. In §2, we analyze the 
characteristics of Monte Carlo applications. We also quickly describe the ran- 
dom number generation requirements for these applications on parallel, dis- 
tributed, and grid-based architectures. We then develop a generic grid-computing 
paradigm for Monte Carlo computations. We discuss how to take advantage of 
the characteristics of Monte Carlo applications to improve the performance and 
trustworthiness of Monte Carlo grid computing in §3 and §4, respectively. Fi- 
nally, §5 summarizes our conclusions and future research directions. 



2 Grid-Based Monte Carlo Applications 



Among parallel, distributed, and grid applications, those using Monte Carlo 
methods, which are widely used in scientific computing and simulation, have 
been considered too simplistic for consideration due to their natural parallelism. 
However, below we will show that many aspects of Monte Carlo applications can 
be exploited to provide much higher levels of performance and trustworthiness 
for computations on these architectures. According to word of mouth, about 50% 
of the CPU time used on supercomputers at the U. S. Department of Energy 
National Labs is spent on Monte Carlo computations. Unlike data-intensive ap- 
plications, Monte Carlo applications are usually computation intensive [6], and 
they tend to work on relatively small data sets. Parallelism is a way to acceler- 
ate the convergence of a Monte Carlo computation. If N processors execute N 
independent copies of a Monte Carlo computation, the accumulated result will 
have a variance N time smaller than that of a single copy. In a distributed Monte 
Carlo application, once a distributed task starts, it can usually be executed inde- 
pendently with almost no interprocess communication. Therefore, Monte Carlo 
applications are perceived as naturally parallel, and they can usually be pro- 
grammed via the so-called dynamic bag-of-work model. Here a large task is split 
into smaller independent subtasks and each are then executed separately. Effec- 
tively using the dynamic bag-of-work model for Monte Carlo requires that the 
underlying random number streams in each subtask be independent in a statisti- 
cal sense. The SPRNG (Scalable Parallel Random Number Generators) library [11] 
was designed to use parameterized pseudorandom number generators to provide 
independent random number streams to parallel processes. Some generators in 
SPRNG can generate up to - 1 independent random number streams with 

sufficiently long period and good quality [13]. These generators meet the ran- 
dom number requirements of most Monte Carlo applications for these types of 
architectures. 
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SPRNG was originally designed to provide independent and dynamic streams of 
random numbers for massively parallel supercomputers. The design of SPRNG was 
based on the needs of Monte Carlo applications, like neutronics. In neutronics, it 
is commonplace to associate a single random number stream with each neutron. 
When a neutron splits into more neutrons, each child neutron is then associated 
with a new (and hopefully independent) random number stream. Given such a 
computation, SPRNG ensures that, when desired, such a computation performed 
on a parallel or distributed machine can be entirely reproducible. The details of 
how this is achieved is beyond the current scope [11]. However, these capabilities 
of SPRNG are extensible to distributed and grid-based architectures, and form the 
underpinnings for the grid services we will discuss. In the remainder of this paper 
we thus focus exclusively on grid computing issues for Monte Carlo applications. 

The intrinsically parallel aspect of Monte Carlo applications makes them an 
ideal fit for the grid-computing paradigm. In general, grid-based Monte Carlo 
applications can divide the Monte Carlo task into a number of subtasks by the 
task-split service and utilize the grid’s schedule service to dispatch these in- 
dependent subtasks to different nodes [15]. The connectivity services provide 
communication facilities among nodes providing computational services. The ex- 
ecution of a subtask takes advantage of the storage service of the grid to store 
intermediate results and to store each subtask’s final (partial) result. When the 
subtasks are done, the collection service can be used to gather the results and 
generate the final result of the entire computation. 

The inherent characteristics of Monte Carlo applications motivate the use 
of grid computing to effectively perform large-scale Monte Carlo computations. 
Furthermore, within this Monte Carlo grid-computing paradigm, we can use the 
statistical nature of Monte Carlo computations and the cryptographic aspects of 
random numbers to reduce the wallclock time and to enforce the trustworthiness 
of the computation. 

3 Improving the Performance 

of Grid-Based Monte Carlo Computing 

3.1 The N-out-of-M Strategy 

Subtask- Scheduling Using the N-out-of-M Strategy. The nodes that pro- 
vide CPU cycles in a grid system will most likely have computational capabilities 
that vary greatly. A node might be a high-end supercomputer, or a low-end per- 
sonal computer, even just an intelligent widget. In addition, these nodes are 
geographically widely distributed and not centrally manageable. A node may 
go down or become inaccessible without notice while it is working on its task. 
Therefore, a slow node might become the bottleneck of the whole computation 
if the assembly of the final result must wait for the partial result generated on 
this slow node. A delayed subtask might delay the accomplishment of the whole 
task while a halted subtask might prevent the whole task from ever finishing. To 
address this problem, system- level methods are used in many grid systems. For 
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example, Entropia [7] tracks the execution of each subtask to make sure none 
of the subtasks are halted or delayed. However, the statistical nature of Monte 
Carlo applications provides a shortcut to solve this problem at the application 
level. 

Suppose we are going to execute a Monte Carlo computation on a grid system. 
We split it into N subtasks, with each subtask based on its unique independent 
random number stream. We then schedule each subtask onto the nodes in the 
grid system. In this case, the assembly of the final result requires all the N partial 
results generated from the N subtasks. Each subtask is a “key” subtask, since 
the suspension or delay of any one of these subtasks will have a direct effect on 
the completion time of the whole task. 

When we are running Monte Carlo applications, what we really care about 
is how many random samples (random trajectories) we must obtain to achieve 
a certain, predetermined, accuracy. We do not much care which random sam- 
ple set is estimated, provided that all the random samples are independent in a 
statistical sense. The statistical nature of Monte Carlo applications allows us to 
enlarge the actual size of the computation by increasing the number of subtasks 
from N to M, where M > N . Each of these M subtasks uses its unique indepen- 
dent random number set, and we submit M instead of N subtasks to the grid 
system. Therefore, M bags of computation will be carried out and M partial 
results may be eventually generated. However, it is not necessary to wait for all 
M subtasks to finish. When N partial results are ready, we consider the whole 
task for the grid system to be completed. The application then collects the N 
partial results and produces the final result. At this point, the grid-computing 
system may broadcast abort signals to the nodes that are still computing the 
remaining subtasks. We call this scheduling strategy the N-out-of-M strategy. 
In the N-out-of-M strategy more subtasks than are needed are actually sched- 
uled, therefore, none of these subtasks will become a “key” subtask and we can 
tolerate at most M — N delayed or halted subtasks. 

Also notice that the Monte Carlo computation using the N-out-of-M strategy 
is reproducible, because we know exactly which N out of M subtasks are actually 
involved and which random numbers were used. Thus each of these N subtasks 
can be reproduced later. However, if we want to reproduce all of these N subtasks 
at a later time on the grid system, the N-out-of-N strategy must be used. 

One drawback of the N-out-of-M strategy is we must execute more subtasks 
than actually needed and will therefore increase the computational workload 
on the grid system. However, our experience with distributed computing sys- 
tems such as Condor and Javelin shows that most of the time there are more 
nodes providing computing services available in the grid system than subtasks. 
Therefore, properly increasing the computational workload to achieve a shorter 
completion time for a computational task should be an acceptable tradeoff in a 
grid system. 



Analysis of the N-out-of-M Strategy. In Monte Carlo applications, N is 
determined by the application and depends on the number of random samples 
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or random trajectories needed to obtain a predetermined accuracy. The problem 
is thus how to choose the value M properly. A good choice of M can prevent 
a few subtasks from delaying or even halting the whole computation. However, 
if M is chosen too large, there may be little benefit to the computation at 
the cost of significantly increasing the workload of the grid system. In order to 
determine a proper value of M to achieve a specific performance requirement, 
we study the grid behavior and consider some system parameters. In the N- 
out-of-M strategy, the completion time of a Monte Carlo computational task 
depends on the performance of each individual node that is assigned a subtask, 
the node failure rate, and also the interconnection network failure rate. We make 
the following assumptions to set up our model: 

1. The execution of a task completely occupies a node on the grid, and no other 
jobs can be executed on the same node concurrently. 

2. Compared to the execution time, the tasks’ scheduling time and result col- 
lection time is short enough to be ignored. 

3. Each node works on its task independently. 

4. Each node has an equal probability of obtaining a task from the schedule 
service. The tasks are scheduled without noticing the performance of each 
node. 

To analyze this we establish a Petri Net (PN) model of the N-out-of-M strat- 
egy. This PN model has M nodes in total. A node, i, alternates between an up 
state (place p\^p) and a down state (place P^^own)- Transition represents 

node unavailability (with unavailability rate A) and transition node back to 
service (with availability rate p) . Transition is assigned the task progress 

threshold W (usually 100%) so that the subtask completion condition (token in 
Psubtask) is reached when W is hit. When Psubtask gathers N tokens, transition 
tN-out-of-M enables firing and a token in Pcompiete indicates the completion of 
the Monte Carlo task. 

We also establish a simpler binomial model for the subtask-scheduling scheme 
using the N-out-of-M strategy based on the above PN model. Assume that the 
probability of a subtask completing by time t is given by p(t). The function p(t) 
describes the aggregate probability over the pool of nodes in the grid. In a real- 
life grid system, p(t) could be measured by computing the empirical frequencies 
of completion times over the pool. In this paper, we model p{t) based on an 
analytic probability distribution function. 

Let S be the total number of nodes available in the grid system, 
plyg be the probability of node i participating in the computations is up, 
where ply^ = p/{p-\-X), 

Op be the service rate of node i, which can be measured as the number of 
tasks that can be finished within a specific period of time without interruption. 
Considering node availability, the actual service rate, 6i, in node i is 9i = 0[*p\y^. 

At time f, the probability that a Monte Carlo subtask will be done on node 
i is 1 — Since each node has equal probability to be scheduled a subtask, 
p{t) can be represented as 
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2=1 2=1 

If the service rates, 9i, O 2 , ■ ■ ■ , Os, conform to a distribution with probability 
density function p(t) can thus be written as 

p{t) = l-y[ (2) 

^ Jo 

Here L is the maximum value of 6i in the computation. 

Typically, if all of the nodes have the same service rate 9, p{t) can be simpli- 
fied to 

p{t) = l-e^K (3) 

Then, the probability that exactly N out of M subtasks are complete at time t 
is given by 



^Exactly — N— out — of — m(J) — ^ ^(^)) 



M-N 



( 4 ) 



We can approximate PN-out-of-M{t) using a Poisson distribution with A = 
N *p{t). Then, Pexactiy-N -out-of-M{t)c&Yi be approximated as 



\M 



PExactly — N—out — of — Mij^ ~ 6 



( 5 ) 



The probability that at least N subtasks are complete is thus given by 



PN-out-of-M{t) = E ( • (1 

i=N ^ ^ J 



( 6 ) 



The old strategy can be thought of as “N-out-of-N” which has probability given 

by 

PN-out-of-N{t) = P^ {t). (7) 

Now the question is to decide on a reasonable value for M to satisfy a re- 
quired task completion probability a (when N subtasks are complete on the 
grid). Unfortunately, it is hard to explicitly represent M in analytic form. How- 
ever, we use a numerical method, which gradually increases M by 1 to evaluate 
Pjq-out-of-M{i) until the value of PN-out-of-Mit) is greater than a. This em- 
pirically gives us the minimum value of M . An alternative approach to estimate 
M/N is to use a normal distribution to approximate the underlying binomial. 
When M * (1 —p{t)) > 5 and M *p{t) > 5, the binomial distribution can be ap- 
proximated by a normal curve with mean m = M * p(t) and standard deviation 
a = y/ Mp{t){l — p{t)). Then, we can find the minimum value M that satisfies 
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Fig. 1. Simulations and Model Prediction of the N-out-of-M Scheduling Strategy for 
Grid Monte Carlo Applications. 



H )-^( ) >a, 



( 8 ) 



a o 

where <P is the normal cumulative distribution function. 

In a grid system, nodes providing computational services join and leave dy- 
namically. Some nodes are considered to be “transient” nodes, which provide 
computational services temporarily and may depart from the system perma- 
nently. A subtask submitted to a “transient” node may have no chance of being 
finished. Suppose the fraction of “transient” nodes in a grid is /3, then, we need 
to enlarge M to \M/{1 — /3)] to tolerate these never-finished subtasks. 



Simulation of the N-out-of-M Strategy. In our simulation program of the 
N-out-of-M strategy, we simulated a 1,000-node computational grid. Nodes join 
and leave the system with a specified probability. Also, nodes have a variety 
of computational capabilities. Each simulation is run for 1,000 time steps. (A 
task running on a node with service rate 9 will take 1/9 time steps, e.g., a fast 
node with service rate 0.01 will take 100 time steps to complete the task while 
a slow one with service rate 0.001 will take 1,000) At each time step, a certain 
number of nodes go down while a certain number of nodes become available for 
computation. We built our simulations in order to 

1. evaluate the validity of our model, and to 

2. compare the performance of the N-out-of-M strategy in grid systems with 
different configurations. 

Figure 1 shows our simulation results and model prediction of the N-out-of-M 
strategy for grid Monte Carlo applications. Our analytical model matches the 
simulation results quite well. Also, we can find that with the proper choice of 
M (20 in the graph), the Monte Carlo task completion time can be improved 
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Fig. 2. Simulations of the N-out-of-M Strategy on a Grid System with Nodes Service 
Rates Normally Distributed (Mean=0.005, Variance=0.001). 



significantly over the N-out-of-N strategy. However, if we enlarge M too much, 
the workload of the system increases without significantly reducing the Monte 
Carlo task completion time. Also, we notice that as time goes on, the N-out-of- 
M strategy always has a higher probability of completion than the N-out-of-N 
strategy, although they all converge to probability one at large times. 

Figures 2 and 3 show the simulation results of the N-out-of-M strategy in 
different grid systems. Both simulated grid systems assume that the service rates 
0 of nodes are normally distributed with the same means (0.005) but different 
variances (0.001 in Figure 2 and 0.003 in Figure 3). Figure 2 simulates a grid 
comprised of nodes with similar performance characteristics. This can be a grid 
constructed from computers in a computer lab that have similar performance 
parameters. On the other hand. Figure 3 is the simulation of a grid whose nodes 
have computational capabilities in a wide range. In practice, this grid can be a 
system with geographically widely distributed nodes like SETIQhome [9], where a 
node might be a high-end supercomputer, or a low-end personal computer. From 
the graphs, we see that the N-out-of-M scheduling strategy improves the Monte 
Carlo task completion time in both grid systems; however, we gain more signifi- 
cant improvement in the system comprised of nodes with service rates having a 
large variance. This experimental result indicates that the N-out-of-M strategy 
is more effective in a grid system where an individual node’s performance varies 
greatly. More interestingly, the simulation results also show that, in both grid 
systems, with a sufficiently large value of M, the time values after which the 
Monte Carlo task is complete with a high probability is close to 200 time steps, 
which is exactly the subtask completion time for a single node with the mean 
(0.005) service rate. Therefore, we can expect that, with a proper number of sub- 
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Fig. 3. Simulations of the N-out-of-M Strategy on a Grid System with Nodes Service 
Rates Normally Distributed (Mean=0.005, Variance=0.003). 



tasks scheduled using the N-out-of-M strategy, the Monte Carlo task completion 
time on a grid can be made to be almost the same as the subtask completion 
time in a node with average computational capability. 



Lightweight Checkpointing. A subtask running on a node of a grid system 
may take a very long time to finish. The N-out-of-M strategy is an attempt to 
mitigate the effect of this on the overall running time. However, if checkpointing 
is incorporated, one can directly attack reducing the completion time of the 
subtasks. Some grid computing systems implement a process-level checkpoint. 
Condor, for example, takes a snapshot of the process’s current state, including 
stack and data segments, shared library code, process address space, all CPU 
states, states of all open files, all signal handlers, and pending signals [12]. On 
recovery, the process reads the checkpoint file and then restores its state. Since 
the process state contains a large amount of data, processing such a checkpoint is 
quite costly. Also, process- level checkpointing is very platform-dependent, which 
limits the possibility of migrating the process-level checkpoint to another node 
in a heterogeneous grid-computing environment. 

Fortunately, Monte Carlo applications have a structure highly amenable to 
application-based checkpointing. Typically, a Monte Carlo application starts in 
an initial configuration, evaluates a random sample or a random trajectory, es- 
timates a result, accumulates means and variances with previous results, and 
repeats this process until some termination condition is met. Although differ- 
ent Monte Carlo applications may have very different implementations, many 
of them can be developed or adjusted in a typical programming structure that 
consists of initialization, followed by a loop for generating statistical samples, 
and ending with the summation of the overall statistics. 
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Thus, to recover an interrupted computation, a Monte Carlo application 
needs to save only a relatively small amount of information. The necessary in- 
formation to reconstruct a Monte Carlo computation image at checkpoint time 
will be the current results based on the estimates obtained so far, the current 
status and parameters of the random number generators, and other relevant pro- 
gram information like the current iteration number. This allows one to make a 
smart and quick application checkpoint in most Monte Carlo applications. Using 
XML [8] to record the checkpointing information, we can make this checkpoint 
platform-independent. More importantly, compared to a process checkpoint, the 
application-level checkpoint is much smaller in size and much quicker to generate. 
Therefore, it should be relatively easy to migrate a Monte Carlo computation 
from one node to another in a grid system. With the application-level checkpoint- 
ing and recovery facilities, the typical Monte Carlo application’s programming 
structure can be amended as described above. However, the implementation of 
application level checkpointing will somewhat increase the complexity of devel- 
oping new Monte Carlo grid applications. 



4 Enhancing the Trustworthiness 

of Grid-Based Monte Carlo Computing 

4.1 Distributed Monte Carlo Partial Result Validation 

The correctness and accuracy of grid-based computations are vitally important 
to an application. In a grid-computing environment, the service providers of 
the grid are often geographically separated with no central management. Faults 
may hurt the integrity of a computation. These might include faults arising 
from the network, system software or node hardware. A node providing CPU 
cycles might not be trustworthy. A user might provide a system to the grid 
without the intent of faithfully executing the applications obtained. Experience 
with SETIhome has shown that users often fake computations and return wrong 
or inaccurate results. The resources in a grid system are so widely distributed 
that it appears difficult for a grid-computing system to completely prevent all 
“bad” nodes from participating in a grid computation. Unfortunately, Monte 
Carlo applications are very sensitive to each partial result generated from each 
subtask. An erroneous partial result will most likely lead to the corruption of 
the whole grid computation and thus render it useless. 

To enforce the correctness of the computation, many distributed computing 
or grid systems adapt fault-tolerant methods, like duplicate checking [10] and 
majority vote [16]. In these approaches, subtasks are duplicated and carried out 
on different nodes. Erroneous partial results can be found by comparing the par- 
tial results of the same subtask executed on different nodes. Duplicated checking 
requires doubling computations to discover an erroneous partial result. Majority 
vote requires at least three times more computation to identify an erroneous par- 
tial result. Using duplicate checking or majority vote will significantly increase 
the workload of a grid system. 
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In the dynamic bag-of-work model as applied to Monte Carlo applications, 
each subtask works on the same description of the problem but estimates based 
on different random samples. Since the mean in a Monte Carlo computation is 
accumulated from many samples, its distribution will be approximately normal 
based in the Central Limit Theorem. Suppose fi, . . . , fi, . . . , fn are the n 
partial results generated from individual nodes on a grid system. The mean of 
these partial results is 

n 

( 9 ) 

i=l 

and we can estimate its standard error, s, via the following formula 



\ 



— F 

n — 1 ^ 



{h-fY- 



( 10 ) 



Specifically, the Central Limit Theorem states that / should be distributed 
approximately as a Student-t random variable with mean /, standard deviation 
and n degrees-of-freedom. However, since n, the number of subtasks, is 
often large, we may instead approximate the Student-t distribution with the nor- 
mal. Standard normal confidence interval theory states that with 68% confidence 
that the exact mean is within 1 standard deviation of /, with 95% confidence 
within 2 standard deviations, and 99% confidence within 3 standard deviations. 
This statistical property of Monte Carlo computation can be used to develop 
an approach for validating the partial results of a large grid-based Monte Carlo 
computation. 

Here is the proposed method for distributed Monte Carlo partial result vali- 
dation. Suppose we are running n Monte Carlo subtasks on the grid, and the ith 
subtask returns partial result, fi. We anticipate that the fi are approximately 
normally distributed with mean, /, and standard deviation, a = sf^/n. We ex- 
pect that about one of the fi in this group of n to lie outside a normal confidence 
interval with confidence 1 — 1/n. In order to choose a confidence level that per- 
mits events we expect to see, statistically, yet flags events as outliers requires us 
to choose a multiplier, c, so that we flag events that should only occur once in a 
group of size cn. The choice of c is rather subjective, but c = 10 implies that in 
only 1 in 10 runs of size n we should expect to find an outlier with confidence 
1 — 1/lOn. With a given choice of c, one computes the symmetric normal con- 
fidence interval based on a confidence of a% = 1 — 1/cn. Thus the confidence 
interval is [/ — / -I- where ^ct /2 is unit normal value such that 

f dx = ^. If fi is in this confidence interval, we can consider this 

partial result as trustworthy. However, if fi falls out of the interval, which may 
happen merely by chance with a very small probability, this particular partial 
result is suspected. 

There are two possibilities for a partial result, fi, to fall out of the confidence 
interval. These are 
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1. errors occur during the computation of this subtask, or 

2. a rare event with very low probability is captured. 

In former case, this partial result is erroneous and should be discarded, whereas 
in the latter case, we need to take it into consideration. To identify these two 
cases, we can rerun the particular subtask that generated the suspicious partial 
result on a trusted node for further validation. 

This Monte Carlo partial result validation method supplies us with a way to 
identify suspicious results without running more subtasks. This method assumes 
that the majority of the nodes in grid system are “good” service providers, which 
can correctly and faithfully execute their assigned task and transfer the result. 
If most of the nodes are malicious, this validation method may not be effective. 
However, experience has shown that the fraction of “bad” nodes in volunteered 
computation is very small. 

4.2 Intermediate Value Checking 

Usually, a grid-computing system compensates the service providers to encour- 
age computer owners to supply resources. Many Internet- wide grid-computing 
projects, such as SETIShome [9], have the experience that some service providers 
don’t faithfully execute their assigned subtasks. Instead they attempt to pro- 
vide bogus partial results at a much lower personal computational cost in order 
to obtain more benefits. Checking whether the assigned subtask from a service 
provider is faithfully carried out and accurately executed is a critical issue that 
must be addressed by a grid-computing system. 

One approach to check the validity of a subtask computation is to validate 
intermediate values within the computation. Intermediate values are quantities 
generated within the execution of the subtask. To the node that runs the sub- 
task, these values will be unknown until the subtask is actually executed and 
reaches a specific point within the program. On the other hand, to the clever ap- 
plication owner, certain intermediate values are either pre-known and secret or 
are very easy to generate. Therefore, by comparing the intermediate values and 
the pre-known values, we can control whether the subtask is actually faithfully 
carried out or not. Monte Carlo applications consume pseudorandom numbers, 
which are generated deterministically from a pseudorandom number generator. 
If this pseudorandom number generator has a cheap algorithm for computing 
arbitrarily within the period, the random numbers are perfect candidates to be 
these cleverly chosen intermediate values. Thus, we have a very simple strat- 
egy to validate a result from subtasks by tracing certain predetermined random 
numbers in Monte Carlo applications. 

For example, in a grid Monte Carlo application, we might force each subtask 
to save the value of the current pseudorandom number after every N (e.g., N = 
100,000) pseudorandom numbers are generated. Therefore, we can keep a record 
of the A^th, 27Vth, . . . , kNth random numbers used in the subtask. To validate 
the actual execution of a subtask on the server side, we can just recompute the 
A^th, 27Vth, . . . , kNth random numbers applying the specific generator with 
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the same seed and parameters as used in this subtask. We then simply match 
them. A mismatch indicates problems during the execution of the task. Also, 
we can use intermediate values of the computation along with random numbers 
to create a cryptographic digest of the computation in order to make it even 
harder to fake a computational result. Given our list of random numbers, or 
a deterministic way to produce such a list, when those random numbers are 
computed, we can save some piece of program data current at that time into an 
array. At the same time we can use that random number to encrypt the saved 
data and incorporate these encrypted values in a cryptographic digest of the 
entire computation. At the end of the computation the digest and the saved 
values are then both returned to the server. The server, through cryptographic 
exchange, can recover the list of encrypted program data and quickly compute 
the random numbers used to encrypt them. Thus, the server can decrypt the 
list and compare it to the “plaintext” versions of the same transmitted from the 
application. Any discrepancies would flag either an erroneous or faked result. 
While this technique is certainly not a perfect way to ensure correctness and 
trustworthiness, a user determined on faking results would have to scrupulously 
analyze the code to determine the technique being used, and would have to know 
enough about the mathematics of the random number generator to leap ahead as 
required. In our estimation, surmounting these difficulties would far surpass the 
amount of work saved by gaining the ability to pass off faked results as genuine. 

5 Conclusions 

Monte Carlo applications generically exhibit naturally parallel and computation- 
ally intensive characteristics. Moreover, we can easily fit the dynamic bag-of-work 
model, which works so well for Monte Carlo applications, onto a grid system to 
implement large-scale grid-based Monte Carlo computing. Furthermore, based 
on the analysis of grid-based Monte Carlo applications, we may take advantage 
of the statistical nature of Monte Carlo calculations and the cryptographic na- 
ture of random numbers to enhance the performance and trustworthiness of this 
Monte Carlo grid-computing infrastructure at the application level. 

We are developing a Grid-Computing Infrastructure for Monte Carlo Appli- 
cations (GCIMCA) based on the Globus toolkit [5] and the SPRNG library [11], 
using the techniques described in this paper. The infrastructure software aims 
to provide grid services to facilitate the development of grid-based Monte Carlo 
applications and the execution of large-scale Monte Carlo computations in a 
grid-computing environment. At the same time, we are also trying to execute 
some real-life large-scale Monte Carlo applications, such as the Monte Carlo sim- 
ulation of Ligand-Receptor interaction in structured protein systems [17] and the 
Monte Carlo molecular modeling applications, on our developing grid-computing 
infrastructure. 
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Abstract. Systems of linear algebraic equations Ax = b occur very often 
when large-scale mathematical models are treated. The solution of these 
systems is as a rule the most time-consuming part of the computational 
work when large-scale mathematical models are handled on computers. 
Therefore, it is important to be able to solve such problems efficiently. 
It is assumed that the systems Ax = b, which must be solved many 
times during the treatment of the models, are (i) very large (containing 
more than 10® equations) and (ii) general sparse. Moreover, it is also 
assumed that parallel computers with shared memory are available. An 
efficient algorithm for the solution of such large systems under the above 
assumptions is described. Numerical examples are given to demonstrate 
the ability of the algorithm to handle very large systems of linear alge- 
braic equations. 

The algorithm can be applied in the treatment of some large-scale air 
pollution models without using splitting procedures. 



1 Motivation 

Large-scale air pollution models can successfully be used to resolve many tasks 
connected with the selection of appropriate measures that could be applied in the 
efforts to avoid possible damages caused by harmful pollutants. Some of these 
tasks are legislated in official documents of the Parliament of the European 
Union (see, for example, [3]). 

Most of the large-scale air pollution models are described by systems of par- 
tial differential equations (the number of equations being equal to the number 
of chemical species studied by the model); see [1,8,18,19]. The direct applica- 
tion of either finite elements or finite differences during the numerical treatment 
of such models leads to the solution of very large systems of linear algebraic 
equations. Assume that the numbers of grid points along the coordinate axes is 
JVx, Ny and respectively. Assume also that the number of chemical species 
is Ns- Several systems containing N = x Ny x x Ng equations have to 
be treated at each time-step. If = Ny = 480, N^ = 10 and Ng = 35, then 
N = 80640000. Not only can N be very large (as shown in the above example), 
but also many time-steps are to be carried out. The number of time-steps is 
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normally in the range from O(IO^) to 0(10®). Thus the computational tasks are 
very difficult both because of the computing time requirements and because of 
the storage needed. Most of the difficulties can be overcome by applying some 
splitting procedure. The large problem is split into a sequence of small problems, 
which can be treated on the available computers. 

The splitting procedures simplify very much the computational tasks. How- 
ever, they are also causing splitting errors. It is difficult to control the size of 
these errors. This is why it is desirable to avoid (if possible) the application of 
splitting procedures. This can be done only if new and very efficient methods 
for the solution of very large systems of linear algebraic equations are developed 
and used. Some preliminary results obtained in the efforts to develop such meth- 
ods will be discussed in this paper. The results are rather general and it will be 
possible to apply the new methods for the treatment of very large systems of 
linear algebraic equations also in some other fields of science and engineering. 

2 Major Assumptions 

Since the direct discretization of the systems of partial differential equations 
arising in air pollution modelling leads to the solution of large systems of linear 
algebraic equations, we shall assume that systems of the type 

Ax=b, beR^, xeR^, (1) 

are to be solved. Matrix A is (i) very large (in this paper “very large” means 
that N > 10®), (ii) non-symmetric and (iii) very badly scaled, because some of 
the non-zero elements of A depend on concentrations of the chemical species, 
which vary in the range from 0(10“®) to 0(10®®) because these quantities are 
measured by using the number of molecules in cubic centimeter. Therefore, the 
direct use of iterative methods is not advisable. 

Matrix A is in general banded. However, when A is very large, then the 
bandwidth is also large and the use of direct methods for banded matrices is 
prohibitive (in spite of the fact that very efficient software for banded matrices is 
available; see [2]). Indeed, the bandwidth is about 168 000 when an air pollution 
model involving 35 species is discretized on a (480 x 480 x 10) grid. 

Matrix A is sparse. However, it is also difficult to exploit the sparsity di- 
rectly, because of the great number of fill-ins which will be produced during the 
factorization even if a good pivotal strategy (as, for example, some of the pivotal 
strategies described in [15] and [17]) is used in the efforts to preserve better the 
original sparsity pattern of matrix A. 

This short analysis indicates that it is worthwhile to try the following ap- 
proach. Apply some good method for factorization of sparse matrices combined 
with dropping of all non-zero elements which are smaller than a given drop- 
tolerance parameter multiplied by the maximal in absolute value element in 
the row to which the candidate for dropping belongs. An approximate LU- 
factorization will be calculated in this way (the method is fully described in 
[7, 16, 17]). The approximate LC/-factorization of matrix A can then be used as a 
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preconditioner for some iterative method (as, for example, some of the methods 
discussed in [4,5,9-14,16,17]). 

The calculation of the preconditioner remains the most time-consuming part 
in the solution of (1) also when dropping is used during the factorization in order 
to preserve better the sparsity pattern of matrix A. Therefore, it is important 
to develop a parallel algorithm for the computation of the preconditioner in 
which the ideas sketched in the previous paragraph can efficiently be applied. 
The development of such a parallel algorithm is discussed in this paper. 

3 Description of the Algorithm 

The algorithm used to obtain an approximate L 17-factorization of matrix A (that 
can be used as a preconditioner in the solution of the system of linear algebraic 
equations by some iterative method), consists of the following five steps. 

3.1 Step 1 Obtaining an Upper Block- Triangnlar Matrix 

A simple reordering is performed in an attempt to push the non-zeros in the 
lower triangular part of matrix A as close to the main diagonal as possible. First 
the columns of A are ordered. Let cj be the number of non-zeros in column j. 
Column interchanges are applied to obtain a permutation of A such that 

j <i k Cj ^ Cfc . (2) 

After that the rows are ordered in a similar way. Assume that Vi is the number 
of non-zeros in row i. Row interchanges are applied to obtain a permutation (of 
the matrix obtained after the column interchanges) such that 

i <m ^ n <rm- (3) 

The result of the column and row interchanges performed as described above 
is a matrix in which many non-zero elements are moved to the lower left-hand 
corner of the matrix and, thus, the matrix obtained after these interchanges can 
be considered as an upper block-triangular matrix. 

Some other reordering algorithms can also be used. For example, the algo- 
rithm LORA from [6] can be applied. The advantage of the algorithm sketched 
above is that it is very cheap; its computational cost is 0{NZ), where NZ is 
the number of non-zeros in A. The computational cost of LORA is 0{NZ log N) 
which is a considerable increase when N is very large. On the other hand, a more 
expensive computationally, but also more advanced, algorithm (such as LORA) 
may lead to a better reordering of the matrix and, thus, to a better performance 
in the remaining steps of the algorithm. 

3.2 Step 2 Dividing the Reordered Matrix into Block-Rows 

The reordered, as described in the previous sub-section, matrix can be divided 
into several block-rows the diagonal blocks of which are rectangular matrices. 
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Denote the reordered matrix by B (i.e. B = PAQ, where P and Q the per- 
mutation matrices that are induced by the row and column interchanges). It is 
desirable to divide matrix B into q block-rows which can be processed in par- 
allel. A simple example for such a division with g = 4 is given in Fig. 1. While 



Bll B\2 Bi3 Bi4 

0 B22 B23 B24, 

0 0 B33 B34 

0 0 0 B44 

Fig. 1. Division of matrix B when q — A. 



the special case g = 4 is used both in Fig. 1 and in the other figures (in order 
to facilitate the understanding of the ideas), all remarks given below are for the 
general case (i.e. for an arbitrary g). 

Note that all the blocks located under the diagonal blocks Bn, where i = 
1,2, ... ,g — 1 contain only zero elements. This is essential in the attempts to 
achieve parallel computations when the different block-rows are processed. 

It should also be noted that the diagonal blocks are rectangular. The number 
of rows is greater than or equal to the number of columns for the first g — 1 
diagonal blocks. The situation is changed for the last diagonal block Bqq, i.e. the 
number of columns is less than or equal to the number of rows. 

The computations in each block-row are independent of the computations in 
the remaining block-rows. This means that there are g parallel tasks related to 
the computations in the next step (Step 3). It is important to emphasize here 
that an attempt to produce block-rows with approximately equal numbers of 
rows is carried out in order to achieve a better loading balance. The fact that 
rectangular diagonal blocks Bn, with i = 1, 2, . . . , g, are allowed facilitates this 
task. However, a perfect loading balance will in general not be achieved even if 
the numbers of the rows in all block-rows are the same (because the numbers of 
non-zero elements may vary from one block-row to another). Therefore, q > p 
(where p is the number of processors which are to be assigned to the job) is used 
in the experiments in Section 4 in order to increase the possibility of achieving 
a better loading balance. 



3.3 Step 3 Factorizing the Block-Rows 

Gaussian transformations are carried out in order to produce zero elements under 
the main diagonal of each diagonal block Bn, i = 1,2, . . . ,q. These calculations 
can be performed in parallel. The result of performing Gaussian transformations 
for the matrix given in Fig. 1. is shown in Fig. 2. 

While the special case g = 4 is used in Fig. 2 in order to facilitate the under- 
standing of the ideas, the remarks given below are for the general case (i.e. for 
an arbitrary g). 

It is seen (compare Fig. 2 with Fig. 1) that each of the first g — 1 block-rows 
of matrix B is divided into two block-rows. Gonsider the diagonal row-block Bn, 
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i = 1,2, ...,(7 — 1. Assume that this diagonal block contains rrii rows and 
columns. Then, as a results of the Gaussian transformations in row-block i, the 
diagonal block Bn is split to an upper triangular matrix Cn G and a 

zero matrix of order {rm — rii) x rii. 

Consider block Bqq and assume that it contains niq rows and Uq columns 
with niq < Uq. The results of performing Gaussian transformation in row-block 
B„ 



will lead to an upper triangular 


matrix Cqq 
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Gii 


Gi 2 


Gi 3 


Gi 4 


Ei 4 




0 


Di 2 


Di3 


Di 4 


Fi 4 




0 


C22 


C23 


G24 


E24 




0 


0 
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D34 


F34 
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C44 


E44 



Fig. 2. The result of the factorization of the block-rows of the matrix given in Fig. 1. 



The dimensions of the remaining blocks in Fig. 2 are listed below: 

- Cij e 2, . . . , g — 1, j = 2, 3, . . . g — 1, and i < j; 

- A,- G i = l,2,...,g- 1, j = 2,3,.. .g and z < j; 

- E,q with f = 1 , 2 ,..., g; 

- F,q ^ with f = l,2,...,g-l. 

The computations related to the blocks Cij (z, j = 1, 2 , . . . , g and i < j) as well as 
the computations related to the blocks Eiq (z = l,2,...,g— 1) are finished at the 
end of Step 3. The further computations are carried out in the remaining blocks. 
Therefore, it is desirable that these blocks are small (the reordering procedure 
from Step 1 is performed in attempt to reduce as much as possible the size of 
these blocks). 

3.4 Step 4 Producing Zeros in Blocks Dij 

The computations should be continued by producing zeros in the blocks Dij 
with z = 1, 2 , . . . , g — 1 , j = 2 , 3 ,...g and z < j. It is appropriate to carry 
out the calculations by “diagonals”. During the calculations along the “first 
diagonal”, the pivotal elements of block C22 are used to produce non-zeros in 
Di2, the pivotal elements of G33 are used to produce non-zeros in D23, and so 
on. The number of tasks along the “first diagonal” is g — 1 and these tasks can 
be performed in parallel. When the calculations along the “first diagonal” are 
finished, the calculations along the “second diagonal” can be started. The pivotal 
elements of block C33 are used to produce non-zeros in D13, the pivotal elements 
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of C 44 are used to produce non-zeros in I? 24 j and so on. The number of parallel 
tasks is now q — 2. The calculations are carried out in a similar manner along the 
“remaining diagonals” . It is important to emphasize that the number of parallel 
tasks is reduced by one when the calculations along the next diagonal are to be 
carried out. This is not a big problem when the blocks Dij are small. In order 
to improve the performance it is worthwhile to choose the number of block-rows 
q considerably larger than the number of processors p that will be used (this is 
another reason for choosing q > p; the first one was discussed in the end of §3.2). 

When the calculations in Step 4 are completed, the matrix shown in Fig. 2 
will be transformed so that the structure given in Fig. 3 will be obtained. As 
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Cl4 
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C44 


E44 



Fig. 3. The result of producing zeros in the blocks Dij 



of the matrix given in Fig. 2. 



mentioned above, the number of parallel tasks is gradually reduced when the 
calculations are carried out by “diagonals”. Parallel computations can also be 
achieved in the following way. Consider any block {i = 1,2, . . . ,q — 1, j = 
2,3,...,q and i < j). The production of zeros in any row of this block (by 
using the pivotal elements of block Cjj ) is a parallel task. This means that the 
computations in each row of block Djj can be carried in parallel. The parallel 
tasks are not so large as in the previous case (where zeros in whole blocks Djj 
are produced). However, the tasks are still sufficiently large when matrix A is 
very large. Moreover the possibility of achieving a better loading balance is in 
general increased, because the number of parallel tasks is very large. 

The second approach is used to obtain the results shown in Section 4. 



3.5 Step 5 Reordering and Finishing the Compntations 

The matrix, which structure is shown in Fig. 3, can be reordered (by row per- 
mutations) to the form given in Fig. 4. It should be emphasized here that it is 
not necessary to perform the actual row interchanges, it is quite sufficient to 
renumber the rows. 

The large block formed by the blocks Cjj and the appropriate zero blocks is 
an upper triangular matrix. The order of this large block is 
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Fig. 4. The result of reordering the blocks in the matrix given in Fig. 3. 

q-l 

(4) 

The column-block formed by the blocks Eiq is a rectangular matrix. It is 
clear that the first dimension of this block is the same as that of the square 
matrix formed by the blocks Cij , which is given in (4) . The second dimension of 
this matrix is Ug — mq. 

Simple calculations show that the matrix formed by the blocks F^g is square. 
Its order is Uq — niq. Furthermore, this matrix is normally not very large. Fur- 
thermore, it should be expected that this matrix is relatively dense, because a 
lot of computations involving the blocks Fiq have already been carried out in the 
previous steps. Therefore, it is worthwhile to switch to dense matrix technique 
in Step 5 and to apply subroutines from, for example, [2]. 

4 Numerical Results 

The numerical algorithm discussed in the previous sections has been tested by 
using different matrices produced by the matrix generators described in [17]. 
There are several advantages when such generators are used: (i) one can produce 
arbitrarily many matrices (while the number of test-matrices in the available 
data-bases is limited), (ii) one can vary the size of the matrices, (iii) one can 
vary the sparsity pattern, (iv) one can vary the number of non-zero elements 
and (v) one can vary the condition number of the matrices tested. 

A few experiments will be given in this section to demonstrate the perfor- 
mance of the algorithm in the following situations. 

— The behaviour of the numerical algorithm when the number q of block-rows is 
varied for very large matrices is different from the corresponding behaviour 
for small matrices. It should be noted that matrices of order 0(10®) are 
called here small. Such matrices are considered as large in many papers on 
the solution of sparse systems of linear algebraic equations. 

— The parallel properties of the algorithm. 

The conclusions drawn in this section and in the next one are based on results 
obtained in much more experiments, than those used in the preparation of Ta- 
ble 1-Table 3. 
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Some attempts to simulate matrices which arise after the discretization of air 
pollution models were made: (i) the average number of non-zero elements per 
row was no less than 10, (ii) the non-zero elements were widely spread and (iii) 
the matrices chosen for experiments were badly scaled. 

All experiments, which are reported here were carried out on SUN computers 
at DCSC (the Danish Center for Scientific Computing). Some experiments on 
IBM computers at the University of Aarhus gave quite similar results. 

4.1 Varying the Number of Block- Rows for Small Matrices 

Systems containing 10® linear algebraic equations are used in this sub-section. 
The number of the non-zero elements is 1 000 110 (i.e. the average number of 
non-zero elements per row is about 10). The number q of block-rows is varied in 
the range from 8 to 512. The results are shown in Table 1 (the computing times 
are given in seconds). Computing times measured in seconds are also used in the 
remaining tables (Table 2 and Table 3). The notation that is applied in all the 
tables can be explained as follows: 

— ORD time is the sum of the time needed for the reordering of the matrix 
(described in Step 1; see Section 3) and the time needed to divide the matrix 
into block-rows (Step 2). 

FACT time is the time needed to obtain the approximate LU-factorization, 
which is used as a preconditioner in the solution part (the preconditioner is 
calculated in Step 3 - Step 5 in Section 3). 

— SOLV time is the time needed to obtain a starting approximation for the 
iterative procedure (the back substitution step) and, after that, to carry out 
iterations until a prescribed accuracy (an accuracy requirement of 10“^ was 
actually used in all experiments) is achieved by the preconditioned iterative 
method chosen (the modified Orthomin method from [17] was actually used 
in all runs). 

— TOTAL time is the total computing time spent in the run. 

Several conclusions can be drawn by studying the results in Table 1 (similar 
conclusions can be drawn by using the results of many other runs which were 
performed) . 

1. The ORD times and the SOLV times do not depend too much on the pa- 
rameter q (on the number of block-rows) . 

2. The FACT times (and, therefore, also the TOTAL times) depend on the 
choice of q. For this matrix, the best result is obtained with the choice 
q = 64. 

3. When the FACT time is the best one (i.e. when q = 64), the ORD time and 
the SOLV time are comparable with the FACT time. 



4.2 Varying the Number of Block-Rows for Large Matrices 

The number of equations in the system of linear algebraic equations is increased 
40 times, i.e. from 100000 to 4000000. The number of non-zero elements in the 
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Table 1. Computing times (measured in seconds) obtained in the solution of a system 
of 100000 linear algebraic equations with different values of parameter q (different 
numbers of block-rows). The number of non-zero elements in the coefficient matrix is 
1000110. These runs were performed on one processor. 



Row-blocks 


ORD time 


FACT time 


SOLV time 


TOTAL time 


8 


1.29 


7.44 


0.83 


9.56 


16 


1.48 


3.55 


0.88 


5.91 


32 


1.32 


1.92 


0.86 


4.10 


64 


1.44 


1.19 


0.90 


3.53 


128 


1.50 


1.39 


1.00 


5.89 


256 


1.53 


5.78 


1.15 


8.45 


512 


1.54 


38.73 


2.47 


42.84 



Table 2. Computing times (measured in seconds) obtained in the solution of a system 
of 4000000 linear algebraic equations with different values of parameter q (different 
numbers of block-rows). The number of non-zero elements in the coefficient matrix is 
40000110. These runs were performed on one processor. 



Row-blocks 


ORD time 


FACT time 


SOLV time 


TOTAL time 


16 


90 


12030 


63 


12183 


32 


88 


5521 


62 


5671 


64 


90 


2773 


63 


2927 


128 


90 


1372 


64 


1525 


256 


88 


674 


63 


826 


512 


89 


472 


67 


628 



matrix is increased from 1000110 to 40000110, i.e. the average number of non- 
zero elements per row is again about 10. This very large system is solved by 
using different values of q. Results are shown in Table 2. 

Similar conclusions, as those drawn in the previous sub-section, can be drawn 
by studying the results in Table 2 (it should again be emphasized that the results 
obtained in many other runs show the same trends) . 

1. It is clearly seen that both the ORD times and the SOLV times practically 
do not depend on the parameter q (on the number of block-rows) . 

2. The FACT times (and, therefore, also the TOTAL times) depend on the 
choice of q. For this matrix, the best result is obtained with the choice 
(?= 512. 

3. The FACT time is the largest part of the TOTAL time also with the best 
choice of q (i.e. when q = 512). 

4.3 Parallel Runs 

The system used in the previous section (i.e. the system with 4000000 equations 
and with 40000110 non-zero elements in its coefficient matrix) was also run on 
four processors. Results are shown in Table 3. Not only are the computing times 
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shown in Table 3, but also the speed-up factors are given there (in brackets). 
The speed-up factors are calculated by taken the ratios of the computing times 
obtained in the runs on one processor (these times are given in Table 2) and 
the computing times obtained on four processors. 



Table 3. Computing times (measured in seconds) obtained in the solution of a system 
of 4000000 linear algebraic equations with different values of parameter q (different 
numbers of block-rows). The number of non-zero elements n the coefficient matrix is 
40000110. These runs were performed on four processors. The speed-up factors are 
given in brackets. 



Row-blocks 


ORD time 


FACT time 


SOLV time 


TOTAL time 


16 


75(1.17) 


3073 (3.91) 


36 (1.75) 


3184 (3.83) 


32 


75(1.19) 


1389 (3.97) 


36 0-72) 


1500 (3.78) 


64 


75(1.20) 


694 (3.99) 


36 (1.75) 


803 (3.65) 


128 


75(1.22) 


345 (3.98) 


36 (1.78) 


457 (3.34) 


256 


75(1.19) 


180 (3.74) 


37 (1.70) 


292 (2.83) 


512 


75(1.22) 


206 (2.29) 


38 (1.76) 


330 (1.90) 



The results that are given in Table 3 (as well as the results obtained in many 
other runs) indicate that the following conclusions can be drawn. 

1. The ORD times and the SOLV times remain practically the same when 
parameter q (i.e. the number of block-rows) is varied. 

2. The FACT times (and, therefore, also the TOTAL times) depend again on 
the choice of q. For this matrix, the best result is obtained with the choice 
q = 256. 

3. The ORD time and the SOLV time remain considerably smaller than the 
FACT time also when the FACT time is the smallest one (i.e. for the best 
choice of q). 

4. The speed up factors for the ORD times remain practically the same for all 
values of q. These factors are also very small. This shows that the ordering 
of matrix and the division to row-block are two operations which are in fact 
run sequentially. 

5. Also the speed up factors for the SOLV times remain practically the same for 
all values of q. These factors are greater than the corresponding factors for 
the ORD times, but still not very large. This shows that the back substitution 
and the iterations during the preconditioned modified Orthomin algorithm 
do not parallelize very well. 

6. The speed up factors for the FACT times tend to decrease when the value of 
q is increased (i.e. when the matrix is divided into more block-rows). These 
factors are greater than the corresponding factors for the ORD times and 
the SOLV times. The efficiency achieved (i.e. the speed-up factor divided 
by the number of processors and multiplied by 100) is very often over 90% 
(about 94% for the best choice q = 256). 
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7. Also the speed up factors for the TOTAL times tend to decrease when the 
value of q is increased (i.e. when the matrix is divided into more block-rows). 
These factors are smaller than the corresponding factors for the FACT times, 
but considerably greater than the correspond factors for the ORD times and 
the SOLV times. The efficiency achieved is often over 90%. However, for the 
important case where the best choice is made (g = 256) the efficiency is 
considerably smaller; about 71%. 

5 Conclusions 

An algorithm for the solution of very large sparse systems of linear algebraic 
equations was discussed and tested. Several conclusions about the performance 
of the algorithm (including the performance on a SUN parallel computer) were 
drawn in the previous section. 

Some additional improvements are needed in order to apply this algorithm 
in the treatment of large-scale mathematical models for studying air pollution 
phenomena. It is necessary to improve the performance of the following parts of 
the algorithm: 

— the reordering part, 

— the division into block-rows 

— the preconditioned iterative methods. 

It might be worthwhile to try to exploit some special properties of the systems 
of linear algebraic equations arising in large scale air pollution models in order 
to find a simpler and more efficient preconditioner. 
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Abstract. Several methods have been suggested for obtaining the es- 
timates of the solution vector for the robust linear regression model 
Ax = b + e using an iteratively reweighted criterion based on weighting 
functions, which tend to diminish the influence of outliers. We consider a 
combination of Newton method (or iteratively reweighted least squares 
method) with a Krylov subspace method. We show that we can avoid 
preconditioning with A^A preconditioner by merely transforming the 
sequence of linear systems to be solved. Appropriate sequences of linear 
systems for sparse and dense matrices are given. By employing efficient 
sparse QR factorization methods, we show that it is possible to solve 
efficiently large sparse linear systems with an unpreconditioned Krylov 
subspace method. 

1 Introduction 

Consider the standard linear regression model y = Ax + e, where y S 3?"* is a 
vector of observations, A € > n) is the data or design matrix of rank 

n, a; G 3?” is the vector of unknown parameters, and e € 3?"* is the unknown 
vector of measurement errors. Let the residual vector r be given by 



where p is a given weighting function, and tr is the scale factor (connected to 
the data). The functions p(-) we are interested in are those that are less sen- 
sitive to large absolute residuals. The functions p(-) we will consider are twice 
continuously differentiable almost everywhere with p" {■) > 0. 

Baryamureeba [1] recently suggested the Barya function 



r{x) = y — Ax; where rj{x) = yj — Aj,x, j = 1, ... ,m. 



( 1 ) 



where Aj, denotes the row of A. In this paper we need to solve 



m 



min/(a;) = min p(rj(x)/ a), 

X X ^ 



(2) 




(3) 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 67-75, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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where (3 and r(0 < r < 1) are problem-dependent parameters. Huber 

T if 1^1 ^ 



gested the Huber function p{z) = 



sug- 

, where /3 is a problem- 



, P\z\ ~^ii\z\> !3 
dependent parameter. The Logistic function is p{z) = /3^ log(cosh(^)). Fair [6] 

proposed the Fair function p{z) = /3^ — log -I- • Huber [8] also pro- 



posed the function p{z) = 



z^!2 if |z| < (3 
/3V2if |-2| > [3 



which is given the name Talwar 



[7]. The Barya function is a linear combination of the Talwar and LSQ weight- 
ing functions. In particular, let pi{z), P 2 {z) and pz{z) denote the Barya, LSQ 
and Talwar functions respectively. Then p\{z) = Tp 2 {z) -I- (1 — t)p 3 {z). The 
parameter r must be chosen to satisfy two conflicting goals: (i) to guarantee a 
positive definite weight matrix and (ii) to minimize the influence of large abso- 
lute residuals. Let T> = T>i\J'D 2 [jDs, where T>i = [— oo, —f3), T >2 = [— /3, /3] and 
I ?3 = (/3, oo] are the piece-wise sub-domains over which the Barya function is 
defined. It is clear that the Barya function is strictly convex in any sub-domain 
T>i , i = 1, 2, 3. It is also continuous over the entire domain T>. Thus we refer to 
the Barya function as a piece-wise convex function. 



1.1 Problem Formulation 



For a known a the first order conditions for (2) are: 



f f L ^ 

F{x) = Vf{x) = ^ Wp{rj{x)/a) = A^v = 0, (4) 

^ ' a 

i=i 



where v is an m- vector with elements p' {rj{x) jcf). 

Newton’s Method Approach. One of the methods for solving the nonlin- 
ear problem (4) is Newton’s method. Given an initial guess we compute a 
sequence of search directions Ax^ and iterates x^ as follows: 

A^G^AAx^ = aA^v'^, (5) 



xk+l — 

Ax^ , where is a diagonal matrix with diagonal elements 
Gjj = p''{rj{x^)/cF) and a G (0, 1] is a step-length parameter. We remark that 
when G^ is positive definite, A^G^AAx^ = aA^v^ can be formulated as a 
weighted least squares problem [AAx^ — a {G^)~^ 

Iteratively Reweighted Least Square (IRLS) Method. The IRLS linear 
system corresponding to (4) differs from the Newton equation (5) in the weight 
matrix G. For IRLS method G^^ = {r j{x^) / p' {r j{x^) / <x) . 



Lp Approach. On the other hand. Scales et al. [11] suggest another formulation 
of IRLS algorithm based on the optimization problem: 

p 



'Si ThiVi AjXj 



I < p. By setting the x derivative of this problem to 



zero, we get the generalized normal equation 



A^GAx = A^ Gy 



( 6 ) 
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for rj{x) yf 0; Tj{x) is defined in (1) and G is the diagonal matrix defined by 

if > 77 

\ 77 P -2 if |r-j(x'^)| < 77 ’ ' 

where we choose 0 < 77 < 1 sufficiently small. 

Active Set Method. Let z G be defined by 



where rj[x) is given by (1). For the Talwar function, the influence of large ab- 
solute residuals is reduced drastically from both the right hand side and the 
coefficient matrix since p'{zj) = 0 and p"{zj) = 0 for all \zj\ > j3. Observe 
that IRLS method and Newton method are equivalent for this function since 
p"{zj) = p'{zj)/zj for all j. Hence, Newton’s method based on the Talwar func- 
tion does not require a line-search when a direct method is used. Do we need 
a line-search when we carry out inexact computation? Interestingly, it has been 
shown [1] that the Talwar approach is an active set method (c.f. simplex method) 
where elements of corresponding to small absolute residuals form the basic 
variables and those corresponding to large absolute residuals form the nonbasic 
variables. 

2 IRLS Method Based Comparative Study 

The Barya and Huber Functions. Recall that IRLS method based on the 
Huber function is globally convergent [9] and no line-search is necessary with 
exact computation (direct method). For the Barya function, IRLS method is 
equivalent to Newton’s method. Here we compare the Barya and Huber func- 
tions at a given iteration of IRLS method. The purpose of this comparison is to 
determine the choice of the parameter r in (3). Let z defined by ( 8 ) be generated 
by IRLS method based on the Huber function at a given iteration. Let u,v G 3 ?"* 
be such that 




if|zj|</3 

/3sign(zj) if |zj| > /3 and 



(9) 




TZj =T 



if \zj\ < P 

/3sign(zj) = T ^ 77 j if \zj\ > /3 ' 



( 10 ) 



Let K G be a diagonal matrix with diagonal elements 
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Then u = Kv. Let G, H G SfJ’rix'" be diagonal matrices with diagonal elements 

" ■ 1 ra ■ 

_ Jl if 

Then H = KG. Thus we have: 

I : A^GA Ax = aA^v 

II : A"^ HA Ax = aA^u = A^ KGA Ax = aA"^ Kv. 



\zj\ < P 
\zj\ > P ■ 



( 11 ) 



( 12 ) 



Let h = aG ^v. Then the weighted problems corresponding to I and II are: 

2 



I : min 

Ax 

II : min 

Ax 



G^/\A Ax - h) 

KI /2 rci/ 2 (^ Ax-h) 



= mm 
2 Ax 



q 1/2 r^i/2(^ Ax-h) 



(13) 

(14) 



since G and K are positive definite diagonal matrices. To be able to reduce the 
influence of large absolute residuals it is clear from comparing (9) and (10), (11) 
and (12) that Kjj < 1 for all j. Thus we suggest to choose r in (3) by 

P 

maxj I Zj I 

A natural choice of p is 1. In this paper we set p = 1. It is clear from (13) and 
(14) that the IRLS method based on the Barya function reduces the influence of 
large absolute residuals more than IRLS method based on the Huber function. 
Since IRLS method and Newton method are equivalent for the Barya function, 
IRLS method based on the Barya function exhibits quadratic convergence [5] 
near the solution. This is supported by the numerical tests in [3]. 

Recall that the Barya function is a linear combination of the Talwar and 
least squares functions. The parameter p must be kept small (usually 1 < p < 2) 
because of the need to minimize the bound on the spectral condition number of 
the (preconditioned) coefficient matrix and also to keep iterates x^ away from 
the boundary of the feasible region. 

A line-search procedure is not necessary for IRLS method (or Newtons’s 
method) based on the Barya function. For inexact Newton method based on the 
Barya function, if the search directions are computed to low accuracy then we 
need to do a line-search so that f{x) is decreased substantially. Since the Barya 
function is a piece- wise convex function the line-search based on decreasing f{x) 
in (2) will always lead to a local minimum in the desired region [— /3, /3]. 
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Block Form of the Barya Function. Matrix K in (14) is a scale matrix. The 
IRLS method based on the Barya function may be interpreted as applying IRLS 
method based on the Huber function to the scaled problem Ax — h), 

where the scale matrix is carefully chosen to minimize the influence of large 
absolute residuals. Results (12) and (14) show that it is possible for weight 
matrix H (corresponding to the Barya function) to have at least two distinct 
diagonal elements. We only need to ensure that Kjj < 1 for all j. Let p > 1, 
z G 3?™ be given by (8) and 0 < /3i < /?2 • ■ ■ < /3s-i < (is < Ikiloo- Define 
Ti = for 1 < i < s and Ts = ^ . Then we define the block form 

of the Barya weighting function by 

r z|/2 ^ifkil</3i 

P{^j) = S + {I - Ti) §-if (ii < \zj \ < (ii+i for 1 < i < s . (15) 

VTs^■ + {l-Ts)^ if \Zj\ > Ps 

In practice we require that the weight matrix H in (12) to have few distinct 
diagonal elements. Thus we choose s in (15) to be small. On the other hand it 
is easy to show that 

^ ^ ^ \ TiZj if|zj| > Pi for 1 < J < s 

Since 

n"(z-\ = I ^ 

^ ^ ^ Ti if I > /?! for 1 < i < s ’ 

P'{zj) ^ f 1 if \zj \ < Pi 

Zj \ Tjif I > /3i for 1 < i < s ’ 

we can see that p”{zi) = Thus Newton’s method and IRLS method are 

I '■ J ' Zj 

equivalent for this block form of the Barya function. 

3 Solving the Sequence of Linear Systems 

Any robust linear regression problem can be posed as a sequence of weighted 
linear systems and solved by either a direct method or iterative method or both. 

3.1 Iterative Method Approach 

Lemma 1. Let 1 > ti > • • • > > Ti+i > • • • > Tg > 0. Let G G &e 

a diagonal matrix with each diagonal element equal to 1 or Ti for i = 1, • • • , s. 

Then Tg < A)~^ GA) < 1. For the proof see [1]. ■ 

Unpreconditioned Linear System. Consider equation (5). Let 



A = QR, 



( 16 ) 
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where Q € is such that Q^Q = I {Q consists of n orthogonal columns) and 

R G 3?"^” is a nonsingular upper triangular matrix. Given (16), (5) simplifies to 

Q^GQRAx = aQ^v. (17) 

Let Z\a; = R~^Az. Instead of solving (17) we solve the symmetric system 

Q^GQAz = crQ^u, (18) 

and then compute Ax by Ax = R~^Az. 

Preconditioning with A Matrix. Here we consider A preconditioner 
[4]. Considering system (5), the preconditioned linear system can be written as 

{A^Ay^A^GAAx = a {A^Ay^A^v. (19) 

Next we consider the preconditioned matrix A)~^ GA. Given the QR 
factorization (16) 

{A^Ay^A^GA = R-y^GQR. ( 20 ) 

By similarity transformations the matrices R~^Q^GQR and Q^GQ are similar 
and thus XyA^Ay^A^GA) = X,{R-^Q^GQR) = Xi{Q^GQ) i = 1, • • • ,n. 

We note that if G is a positive definite diagonal matrix then niCf^GQ) = 
Hi{{A^ A)~^ A^ GA) < k{G) = masij{Gjj} / imiij{Gjj}., see [1]. Thus from a the- 
oretical point of view it is evident that preconditioning with A^A is unnecessary. 
In other words, instead of solving (5) with A^A preconditioner one should solve 
(18) with an unpreconditioned iterative method, for instance unpreconditioned 
conjugate gradient method. The major draw back of solving (18) is that for a 
sparse matrix A, Q is often nearly full (dense). The general results on the upper 
bound of the spectral condition of the preconditioned matrix [A^ A)~^A"^GA 
are given in [3] and [4]. Similar results for k{Q"^GQ) can be derived by simply 
observing that Q^GQ = {Q"^Q)~^Q^GQ or k{Q^GQ) = k{{A^ A)~^A^GA). 
Least Squares Approach. When G is positive definite then (18) corresponds 
to the weighted least squares problem 

min||G^/^(QA^-(rG"^u)||^. (21) 

Az 

A good and generally available reliable solver like LSQR can applied to (21). 
Linear systems arising from Newton method based on the Barya, Fair, and Lo- 
gistic functions can be formulated in the form (21). Recall that for the Talwar 
function p”{zj) = 0 and p'{zj) = 0 for all Zj > f3. Thus, for the Talwar function, 
(18) can be formulated as a least squares problem of the form (21) if and only if 
A^GA is positive definite, whereas it is not possible to formulate (5) in the form 
(21) for the Huber function since p"{zj) = 0 and p'{zj) = Pj Zj for all Zj > f3 [1]. 
Iteratively Reweighted Least Squares (IRLS) Method. Since G is positive 
definite for IRLS method, (5) & (6) can be formulated in the form (21). 
Related Work. Let G be a diagonal weight matrix and AQ be given by (16). 
Scales et al. [11] used Gonjugate Gradient (GG) method to solve linear systems 
of the form 



A^GAy = h. 



( 22 ) 
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O’Leary [10] used CG method to solve linear systems of the form 

Q'^GQy = h. (23) 

In [3] and [4] preconditioned CG method was used to solve linear systems of the 
form (22) with A as preconditioner. In [4] and [10], a line-search procedure 
was used to determine the step-length when Newton’s method based on the 
Talwar function was used. The following observations are now in order: 

1. For a dense data matrix A, if unpreconditioned CG method is used, it should 

be used on system (23) instead of system (22). When the weight matrix G 
is positive definite, the alternative approach is to solve the corresponding 
weighted least squares problem. Furthermore, for a dense data matrix A, 
instead of solving linear systems of the form (22) with A as preconditioner, 

we suggest solving linear systems of the form (23) with unpreconditioned CG 
method. 

2. For sparse data matrix A, we note that the factor Q may not be as sparse 
as A. Thus there is need to compute the density Sd- For an m-hy-n matrix 
X we define the density 

Sd = nnz(X)/mn. (24) 

At the beginning of the iteration process we need to compute the densities 
of A, Q and R using (24). We then estimate the cost per iteration for solving 
(23) with unpreconditioned CG method and the cost for solving (22) with 
R preconditioner. We use these relative costs when determining the linear 
system to be solved. 

Alternatively, we can compute the cost of one iteration for (22) with 
preconditioner and for (23) with unpreconditioned iterative method and then 
continue with the linear system that is less costly. We may add that with 
emerging efficient sparse QR factorization methods, the choice between solv- 
ing (23) with unpreconditioned CG method and (22) with preconditioned CG 
method depends heavily on numerical results other than theoretical results. 

3. Newton’s method based on the Talwar function should be used without line- 
search and the solution to the linear system should be computed to high 
accuracy. 

We remark that for a positive definite weight matrix G solving the least squares 
problem corresponding to (23) with for example LSQR may be worthwhile. 

3.2 Mixed Method Approach 

We consider solving the sequence of linear systems by combining a direct method 
and an iterative method in some manner, see [2]. The idea is to utilize the 
factorization from the previous iteration in constructing the preconditioner for 
the current linear system. There are situations when preconditioning is necessary. 
For instance when maxj{Gjj}/ min^{Gjj} is very large for G positive definite or 
when IIAII 2 is very large for G positive semi-definite. For a discussion on low-rank 
correction preconditioners see [I]. 
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Table 1. Information on data matrix A. 



Problem 


Sd 


k(A^A) 


name 


m 


n 


A 


Q 


R 


OOIR 


1350 


200 


0.0217 


0.5150 


0.1555 


2.622 xl0'‘ 


002R 


2750 


500 


0.0079 


0.5033 


0.0373 


2.779x10® 


003R 


1921 


950 


0.0063 


0.0020 


0.0081 


5.521x10^^ 


004R 


3045 


1500 


0.0015 


0.1207 


0.0301 


2.945x10® 



4 Numerical Comparison 

The data matrices are sparse random matrices as described in Table 1. We use 
the matlab sparse qr function to compute the QR factorization A = QR. We 
observe that A = R^ R, where R G 3?”^” is a nonsingular upper triangular 
matrix. Table 1 shows that for A sparse, Q may be either sparse or dense. Since 
the problems we are interested in are large and sparse, iterative methods may 
be efficient if we solve the sparse linear system. Thus, it is necessary that we 
compute the densities for A and Q at the beginning of the iterative process. 
Otherwise, we need to compute the cost of one iteration for the linear system in 
A and also for the linear system in Q and then compare the costs. 

5 Concluding Remarks 

(Un)preconditioned Krylov subspace methods have been combined with Inexact 
Newton Method to solve robust linear regression problems. This paper sheds 
more light on when a line-search with Inexact Newton Method is (not) necessary. 
We have suggested new approaches that can be used to efficiently solve robust 
linear regression problems. 
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Abstract. The studied large-scale linear problems arise from Crouzeix- 
Raviart non-conforming FEM approximation of second order elliptic 
boundary value problems. A two-level preconditioner for the case of coef- 
ficient anisotropy is analyzed. A special attention is given to the potential 
of the method for a parallel implementation. 
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1 Introduction 

We consider the elliptic boundary value problem 

Lu = —V • {a{x)\7 u{x)) = f{x) in 17, 

u = 0 on Fjj, (1) 

(a(x)Vu(x)) ■ n = 0 on /at, 

where 17 is a convex polygonal domain in IR^, f(x) is a given function in L^(17), 
a(x) = [o-ij ix )]^ is a symmetric and uniformly positive definite matrix in 17, 
n is the outward unit vector normal to the boundary F = df2, and F = FdUFn. 
We assume that the entries aij{x) are piece- wise smooth functions on 17. In the 
paper we use the terminology of the flow in porous media and we refer to u as 
a pressure and —a(x)Vu(x) as a velocity vector. 

The problem (1) can be discretized by the finite volume method, the Galerkin 
finite element method (FEM), conforming or non-conforming, or the mixed FEM. 
Each of these methods has its advantages and disadvantages when the problem 
(1) is used in a particular application. In [2] Arnold and Brezzi have demon- 
strated that after the elimination of the velocity and pressure unknowns the ob- 
tained Schur system for the Lagrange multipliers is equivalent to a discretization 
of (1) by Galerkin method using linear non-conforming finite elements. Namely, 
in [2] it is shown that the lowest-order Raviart-Thomas mixed finite element ap- 
proximations are equivalent to the usual Groiizeix-Raviart non-conforming lin- 
ear finite element approximations when the non-conforming space is augmented 
with cubic bubbles. Further, such a relationship between the mixed and non- 
conforming finite element methods has been studied and simplified for various 
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finite element spaces (see, e.g. [1,5]). The work in this direction resulted also 
in construction of efficient iterative methods for solving mixed FE systems (see, 
e.g., [6-8]). Furthermore, the stiffness matrix has a regular sparsity structure 
such that in each row the number of non-zero entries is at most five. 

This paper provides a continuation of the study recently published in [12]. 
Here, the sparsity pattern of the approximation algorithm is changed to han- 
dle the case of coefficient anisotropy. The study is also related to the parallel 
techniques from [4, 10], where MIC(O) preconditioning of linear conforming and 
rotated bilinear non-conforming finite elements is considered. 

The rest of the paper is organized as follows. In Sections 2 and 3 we introduce 
the finite element approximation and the two-level algorithm. In Section 4 we 
propose a locally optimized anisotropic sparse approximation B of the Schur 
complement S. The condition number k{B~^S) is uniformly estimated. Next, 
in Section 5 we derive expression for the execution time on a multiprocessor 
computer system which shows a good parallel efficiency for large scale problems. 
Numerical tests of the PCG convergence rate are presented in Section 6 where 
the system size and the anisotropy ratio are varied. At the end, short concluding 
remarks are given. 

2 Finite Element Discretization 

The domain 17 is discretized using triangular elements. The partition is denoted 
by Tfi and is assumed to be quasi-uniform with a characteristic mesh-size h. It 
is aligned with the discontinuities of the coefficient matrix, so that over each 
element e G 7h the function a{x) is smooth. Further, we assume that 7), is 
generated by first partitioning 17 into quadrilaterals Q and then splitting each 
quadrilateral into two triangles by one of its diagonals, see Figure 1. To simplify 
our considerations we assume that the splitting into quadrilaterals is topologi- 
cally equivalent to a square mesh. We approximate the weak formulation of (1) 
using Crouzeix-Raviart non-conforming linear triangular finite elements. The 
FEM space Vh consists of piece wise linear functions over Th determined by 
their values in the midpoints of the edges. The nodal basis functions of Vh have 
a support on not more than two neighboring triangles where the corresponding 
node is the midpoint of their common side. Then the finite element formulation 
is: find Uh G 14, satisfying 



Here a(e) is defined as the integral averaged value of a{x) over each e GTh- Now, 
the standard computational procedure leads to the linear system of equations 



the unknown nodal values of Uh- The matrix A is sparse, symmetric and positive 



Ah{uh,Vh) = {f,Vh) yvh&Vh, where 




( 2 ) 



Hu = f, 



(3) 



where A is the corresponding global stiffness matrix and u G IR'^ is the vector of 
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definite. For large scale problems, the preconditioned conjugate gradient (PCG) 
method is known to be the best solution method. The goal of this study is to 
present a robust and parallelizable preconditioner for the system ( 3 ). 



3 The Two-Level Algorithm 



Since the triangulation is obtained by diagonal-wise splitting of each cell 
Q into two triangles, see Figure 1 -a, we can partition the grid nodes into two 
groups. The first group contains the centers of the quadrilateral super-elements 
Q C and the second group contains the rest of the nodes. With respect to 
this splitting, A admits the following two-by-two block partitioning that can be 
written also in a block-factored form 



All Ai2 




All 


O' 




/ A^j^^Ai2 


A21 A22 




_ A21 


S’ 




1 

0 

1 



where S stands for the related Schur complement. Obviously, An is a diagonal 
matrix so that the Schur complement S can be assembled from the corresponding 
super-element Schur complements Sq = A22-,q — A21; qAi]q^i 2 ;Q, i.e. 



*5= E 

Qeih 



where Lq stands for the restriction mapping of the global vector of unknowns 
to the local one corresponding to a macroelement Q containing two triangles. 
Such a procedure is called static condensation. We now introduce the local Schur 
complement Sq and its approximation Bq: 



Sq 



Sll S12 Sl3 S14 
S2I S22 S23 S24 
S3I S32 S33 S34 
S41 S42 S43 S44 



Bq 



bii —bi2 — &13 0 

— ^21 622 — &23 —^24 

— 631 —632 &33 —^34 

0 —642 — &43 ^44 



(5) 



Here bu > 0 and = bji > 0 . In a very general setting, S and B are spec- 
trally equivalent, if Sq and Bq have equal rowsums. The case 623 = ^32 = 0 
corresponds to the isotropic approximation Bq (see Figure 1 -c). A particular 
case of this kind was proposed and analyzed in [ 12 ]. The dash lines represent the 
connectivity pattern of the Schur complement Sq (see Figure 1 -b) and its locally 
modified sparse approximations Bq and Bq (see Figure l-c,d). Assembling the 
local matrices Bq we get 



B=Y. LIBqLq. (6) 

Q^Th 



After the static condensation step, the initial problem ( 3 ) is reduced to the 
solution of a system with the Schur complement matrix S. At this point we apply 
the PCG method with a preconditioner C defined as a MIC(O) factorization (see, 
e.g., [ 3 , 9 ]) of B, that is, C = Cmic(o){B)- This needs of course B to allow a stable 
MIC(O) factorization, which will be shown in Section 4 . 
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Fig. 1. (a) Node numbering of the super-element Q; (b) Connectivity pattern of Sq\ 
(c) Connectivity pattern of Bq\ (d) Connectivity pattern of Bq. 

4 Locally Optimized Approximation 
for the Model Anisotropic Problem 

The model anisotropic problem analyzed in this section is set on a uniform 
square mesh toh, i.e. 17 = The coefficient matrix a = diag{au,a22) is 

macroelement- wise constant, and the anisotropy ratio is fixed, i.e. 011/022 = e < 
1. Then, the macroelement stiffness matrix Aq and the related Schur complement 
Sq are as follows: 



2(1 -1- e) 


— e 


-1 


— e 


-1' 


— e 


e 


0 


0 


0 


-1 


0 


1 


0 


0 


-1 


0 


0 


-1 


0 


— e 


0 


0 


0 


e 



(7) 



5q 



2022 

T+1 



(2e + £2) _e _e -gZ 

-e (2e-bl) -1 -e 

-e -1 (2e-bl) -e 

-e2 -e -e {2t + e^) 



( 8 ) 



Taking into account the symmetry arguments, we look for a locally optimized 
matrix Bq (see (5)) in the form: 



Bq = 



2022 

1-be 



2-1 -10 

— 1 (2 -I- x) —X 1 

— 1 —X (2 -I- x) — 1 

0-1 -12 



(9) 



Consider now the local eigenvalue problem: find A G i?, 0 yf w C IR^ such that 



Sqw = XBqw. 



( 10 ) 



The optimal value of the parameter x = x{Q) > 0 will minimize the local spectral 
condition number 



kq = k{Bq^Sq) = y 



( 11 ) 
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Obviously Ker{So) = Ker{Bq) = Span{e} where e = (1, 1,1,1)*. Thus, (10) 
is reduced to a 3 x 3 eigenvalue problem and the related eigenvalues are: 

Ai = — - — , A 2 = e, A3 = e(l + e). 

1 + a: 

Three cases are considered depending on the value of x, namely: 



X G 



X G 



X > 



0,--l 

e 

e e 



< e(l + e) < - 
1 + e 



1 + e 



HQ = 



1 + e 



e < 

1 + a; 
1 + e 
1 + a; 



<e(l + e) 



< e < e(l + e) 



e(l + x) ’ 
hq = 1 + e, 

KQ = e(l + x). 



The global condition number estimate follows directly from local one. The result 
of the presented analysis is summarized in the next theorem. 

Theorem 1. Consider the non- conforming FEM problem (2) on a square mesh. 
Let B is defined by (6) where Bq are determined by (9) and the parameter x 
satisfies the local optimality condition 



X G 




( 12 ) 



Then: 

(i) the sparse approximation B of the Schur complement S satisfies the condi- 
tions for a stable MIC(O) factorization; 

(ii) the matrices B and S are spectrally equivalent, where the following estimate 
for the relative condition number holds uniformly with respect to any possible 
coefficients ’ jumps 

k{B~'^S) <l + e<2. (13) 

5 Parallel Preconditioning Algorithm 

The first step, i.e. the static condensation, is local and therefore can be performed 
fully in parallel. This is the reason to focus our attention on the PCG solution 
of the reduced system with the Schur complement S. Let us recall that the 
preconditioner was introduced as C = Cmic(o){B) = L^DL, where D is the 
diagonal and L is the strictly lower triangular part of B. The computational 
complexity of one PCG iteration is given by 

MpcG-'^^N. (14) 

One can find some more details in [12] where a similar analysis for the isotropic 
case is made. 
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MIC(O) is an inherently sequential algorithm where the solution of the arising 
triangular systems is typically recursive. We overcome this disadvantage by the 
proposed construction of the matrix B. For simplicity of the presentation we 
consider the model problem in 17 = (0, 1)^ on a square mesh with a mesh size h = 
1/n (subsequently each cell is split into two triangles to get 7^). The structure of 
the matrices S and B is illustrated in Figure 2 where a column-wise numbering 
of the unknowns is used. A serious advantage of the introduced matrix B is that 
all of its diagonal blocks are diagonal. In this case, the implementation of the 
PCG solution step C“^v is fully parallel within each of these blocks. 





Fig. 2. Sparsity pattern of the matrices S and B, Q = (0, 1)^. 



A simple model for the arithmetic and the communication times is applied 
(see, e.g., [13]) in the analysis of the parallel algorithm. We assume that: a) the 
execution of M arithmetic operations on one processor takes time = Mta, 
where ta is the average unit time to perform one arithmetic operation on one 
processor (no vectorization), and b) the time to transfer M data elements from 
one processor to another can be approximated by Tcom = •^(ts + Mtc), where 
tg is the start-up time and tc is the incremental time necessary for each of M 
elements to be sent, and i is the graph distance between the processors. 

Further, we consider a distributed memory parallel computer where the num- 
ber of processors is p, and n = mp for some natural number m. The domain f2 
is split into p equally sized strips where the processor Pk is responsible for data 
and computations related to the k-th strip. 

Then, we get the following expressions for the communication times related 
to and S'v 

7'com(C“ V) = 8n{ts + tc), Tcom{Sv) = Atg + 2(3n -I- l)tc- 

The above communications are completely local and do not depend on the num- 
ber of processors p > 2 assuming that Pk and Pk+i are neighbors (see Figure 3). 
With o are denoted the components which Pi sends to Pi+i, and with □ - those 
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to Pi-i- The sign x indicates the place where the transferred components are 
stored. Although it is not mapped in the figure, the processor Pj also have to 
receive the corresponding components from processors Pi-i and Pi+\. 

This setting leads to the following expression for the parallel time per one 
PCG iteration 

Tp = T;‘ « 72 ^(^+ + 14nP. (15) 




Fig. 3. Gommunication scheme for matrix-vector multiplication {Sv) and for solution 
of systems with lower triangular (L~^v) and upper triangular (L~'^v) matrices. 



Remark 1. From (15) we conclude that the parallel algorithm is asymptotically 
optimal. A more realistic analysis of the parallel performance needs more infor- 
mation about the behavior of the introduced average timing parameters ta, tg 
and tc- The key point here is that a good parallel scalability could be expected 
if n » pts/ta- 



6 Numerical Tests 



The numerical tests presented below illustrate the PCG convergence rate when 
the size of the discrete problem and the anisotropy ratio are varied in the test 



problem —eUxx—Uyy = /• A relative stopping criterion 



(C 



1 qr’fT'it 



‘) 



(C- 



^) 



< e is used 



in the PCG algorithm, where r* stands for the residual at the f-th iteration step, 
(•, •) is the standard Euclidean inner product, and e = 10“®. The computational 
domain is the unit square Q = (0, 1)^ where homogeneous Dirichlet boundary 
conditions are assumed at the bottom side. A uniform mesh is used, where 
h = 1/n, and the size of the discrete problem \s N = 2n{n + 1). 
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Table 1. PCG iterations: MIC{0) factorization of B. 





e 


n 


N 


1. 


0.1 


0.001 


0.001 


1000 


7 


112 


20 


22 


20 


19 


105 


15 


480 


35 


43 


35 


34 


331 


31 


1984 


58 


76 


55 


53 


886 


63 


8064 


91 


137 


76 


80 


1550 


127 


32512 


152 


260 


107 


112 


2670 



The qualitative analysis of the results given in Table 1 confirms that the 
number of iterations in all cases is namely, it grows propor- 

tionally to -y/n when n is large enough. The influence of the anisotropy is in a 
full agreement with the theoretical estimate k{B~^S) < l-|-e. The last column is 
included in the table to show how important is to have a proper node-numbering 
following the direction of dominating anisotropy. 

7 Concluding Remarks 

In this paper we have proposed an anisotropic modification of the recently intro- 
duced two-level preconditioner for isotropic Crouzeix-Raviart FEM elliptic sys- 
tems. The new construction allows efficient PCG solution of strongly anisotropic 
problems with fixed direction of dominating anisotropy. The new sparsity struc- 
ture of the local matrices Bq does not change the structure of the diagonal 
blocks of the global matrix B. The latter preserves the potential for efficient 
parallel implementation. 

One of the contributions of this paper is the locally optimized construction 
of Bq, where the minimal value of kq is realized in the interval given by Theo- 
rem 1. Our further plans include a generalization to 3-D problems on tetrahedral 
meshes. This is a much more complicated problem where symbolic methods could 
be needed to optimize the parametrized presentation of Bq. In this context we 
would refer to the results from [11] where an element preconditioning technique 
is reported for conforming linear FEM in 3-D. 

The structure of the Schur complement matrix S is the same as the element 
matrix for the rotated bilinear elements introduced by Rannacher and Turek. 
As a result, the parallel properties of the related algorithms are similar. Some 
representative results in this context can be found in [4] where the parallel times 
are analyzed in some more details and a set of numerical tests on distributed 
memory machines are presented. 
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Abstract. The paper is concerned with the FE solution of the linear 
elastic BV problems on composite grids. Special interest is given to the 
implementation and numerical tests of both sequential and parallel codes. 



1 Introduction 

This paper deals with the numerical solution of linear elasticity boundary value 
problems, which can be written in a weak form as follows: 

find u € V : a(u,v) = 6(v) Vv e V. (1) 

Here V C W = is a Hilbert space of vector functions v = (ui,U 2 ,U 3 ) 

(see e.g. [4]). For the numerical solution of the problem (1) we use the finite ele- 
ment (FE) method on composite grids, which arise as a composition of (regular, 
structured) global coarse grid and (regular) local fine grid(s). 

The composite grid FE method starts with a global (coarse, regular) division 
7ff(l7) of the domain 42. For simplicity, we will consider a division of the domain 
f2 C into tetrahedrons. The division Th{^) then enables us to define the 
standard FE space 

Vh{^) = {v&[C{Q)fr\V-. v,\e&Pi ^e&Th{Q)} ( 2 ) 

where P\ is in our case the set of linear polynomials. 

Further, let Qe C 17 be a part of 17 , where the previous discretization should 
be refined and let Th = be a finer discretization in 17^ , which enables 

to define another FE space 

Vh{^2) = {v € [C(17)]^ n V : u, = 0 in Q\Qr, v, \e S Pi € Th{QR)} (3) 

Now, the composite grid FE space can be defined as 

V = Vo + Vi, Vo = Vff(17), V = 14(12) . (4) 

The composite grid FE space V C V can be then used for finding an approximate 
composite grid solution 

u e E : a(u, v) = b(v) Vv S E . (5) 
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Note that, in this construction, can consist of several separate parts. More- 
over, the finer grid in can be further refined and so on, which gives the multi- 
level construction resulting in the composite grid space V = Vq -I- Vi -I- • • • -I- . 

The paper completes results published in [1,2] with numerical tests per- 
formed for 3D case. Tests are performed using both sequential and parallel codes. 



2 The Iterative Solution of Composite Grid Problems 



The decomposition (4) of the composite grid FE space can be used to construct 
iterative methods for solving the composite grid problem (5) . Such a method was 
first introduced in [3]. The algorithm of this method, called FAC, is as follows: 

FAC method 

given u° 

for i = 0, 1 . . . until convergence do 

compute Vo € Vb : a(u* -f vq,v) = b(v) Vv € Vb 

ui+l/2 = u‘ -f Vo 

compute Vi e Vi : \/y g 

U'+^ = U*+^/2 -P Vi 

end 

Further, we will also use a symmetric version of FAC, which is obtained by 
adding another correction from the subspace Vl to the beginning of each FAC 
iteration. This variant will be referred to as SFAC. The additive variant of FAC 
called JFAC (as the Jacobi-type method with damping), takes the form: 

JFAC method 

given u° 

for 1 = 0,1... until convergence do 

compute Vo € Vb : a(u* -P Vo,v) = b(v) Vv e Vb 
compute Vl € Vl : a(u* -P vi,v) = b(v) Vv € Vi 
U'+^ = U* -P Ivo -I- 5V1 

end 



These methods can be rewritten into an operator form which is more conve- 
nient for computer implementation. The composite grid FE problem can be 
then rewritten as the equation 

Au = b (6) 

with a symmetric positive definite operator A. The iterative methods FAC, JFAC 
and SFAC can be written as preconditioned Richardson’s iterations 



u 



fe-l-i _ 



u'^ + G{b - Au'^) 



(7) 
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with 



G = Gfac = Bo + Bi- B^ABo, (8) 

G = Gjfac = \Bo + \Bi , (9) 

G = Gsfac = Bi + Bq — BiABq — BqABi + BiABqABi , (10) 

where B^ = A~^ Ri , Ri and {Ri = I* ) are the restriction and inclusion 

operators, respectively (z = 0,1). 

Since A is symmetric positive definite, the system (6) can be solved by the conju- 
gate gradient method with preconditioning matrices Gfac, Gjfac or Gsfac- 
This preconditioning is implemented as the approximate solution of the system 
Gg = r using one FAC (JFAC, SFAC) iteration starting from the zero initial 
guess. 



3 Implementation 

In this section, we focus on the implementation of the two-level composite grid 
FE method arising from the spaces Vh and Vh- We shall start by introducing 
the standard FE bases {0^} in Vh and Vh, respectively. Then we can 

introduce the subproblem matrices and right-hand side vectors as 

Ao = Ah = , bo = bn = [b{(t)f)\, z, j, = 1, . . . , n// , 

( 11 ) 

Ai = Ah = bi = bh = \h{4>^)] , z, j, = 1, . . . , . (12) 

For the implementation of the composite grid FE method, we must introduce 
some parameterization of the functions from the composite grid FE space V . In 
our case we use the nodal basis implementation for fully compatible global and 
local grids where zz S E is represented in the form 

M = X! '('f + X! + X! ■ (^^) 

Here the first sum is taken over those coarse grid basis functions (p!^ , which have 
their support in 17\ Qji- The second sum is over all basis functions from Vh - The 
third sum is over the new basis functions corresponding to the coarse grid nodes 
lying on the interface df^Fi \ dG. The basis functions C V have zero values 
in all coarse and fine grid nodes with the exception of value 1 in one coarse grid 
node on df2R\dG and the values given by the interpolation in the corresponding 
fine grid nodes on diln \ df2. This interpolation guarantees the continuity of the 
basis function . See also Fig. 1 . 

Note that the nodal basis implementation in principle requires compatibility 
of the global and local grids only on the interface dfin \ dfi. In the case of not 
fully compatible global and local grids, we seek the composite grid solution in 
the space V ^ V , where V is the space of the functions, that have the form 
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supp ((/)f ) 



Fig. 1. Nodal basis of the composite grid FE space 




supp ((/)'*) 



supp 



(13). Functions from V having a support in fin are used only for solving the 
subproblem on the coarse grid which is a part of the preconditioning matrix. 

For the implementation of the conjugate gradient method on composite grids 
and for the implementation of the FAC and JFAC iterative methods as precon- 
ditioners, we need procedures for 

— a prolongation of the displacement u I iUi, i = 0, 1 which is realized by a 

linear interpolation, 

— restrictions of the residual ri = Rir, i = 0, 1 given by the transformation 
of nodal forces from fine grid nodes to coarse grid ones {Ri = I* { = if)) 

— a computation of the residual r = b — Au, implemented using element- 
by-element multiplications for the nodes lying on the interface 917^ \ dI2 
and subproblem matrix and vector multiplications for the nodes outside the 
interface. 

— a solution of the subproblems AiUi = r^, z = 0,1 , obtained by the con- 
jugate gradient method with a preconditioning given by the displacement- 
decomposition (DiD) or the domain-decomposition (DD). 

Remark: 

In some real-life problems the coarse grid cannot resolve the fine distribution of 
materials. It means that 

Ao Ao = RoAIo . (14) 

It was proved ([1]) that even in this case the FAC, SFAC and JFAC methods 
converge if we use a suitable damping. The numerical tests show ( see next 
section) that if we use these methods as preconditioning methods, the CG on 
composite grids may not converge. 
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4 The Numerical Tests 

Numerical tests were performed on two benchmark problems referred to as 
“MOST” problem and “Dolni Rozinka” problem. 



I: The “MOST” Problem — A Bridge Pile under Loading 

This problem, which is illustrated in Fig. 2, is formulated in a domain 17 of 
size 37.2 x 37.2 x 31 metres. This domain contains a concrete pile of size 1.2 x 
1.2 X 15 metres with elastic modulus E = 31.5 GPa, Poisson’s ratio v = 0.2 
and unit weight 7 = 2.5 5 /cm^. The pile is surrounded by an elastic clay with 
elastic modulus E = 19.88 MPa, Poisson’s ratio v = 0.42 and unit weight 7 = 
1.85 g/cTO^. The pile is loaded by a pressure V = 1.5 MPa. On the other sides 
of 17, we assume zero normal displacements and zero tangential stresses. The 
global coarse grid has 32 x 32 x 32 nodes (98304 unknowns), the local fine grid 
uses halved grid sizes, the fine mesh has 23 x 23 x 41 nodes (65067 unknowns). 




_o_ _o_ 

Fig. 2. “Most” problem 



II: The “Dolni Rozmka” Problem 
A Mining at the Uranium Ore Deposit 

This problem, which is illustrated in Fig. 3, is formulated in a domain 17 of size 
1430 X 550 X 600 metres. The two basic materials in the problem domain are 
the uranium ore and the surrounding rock. These materials are assumed to be 
elastic with the same material constants ~ elasticity modulus E=2000 MPa, 
Poisson ratio v = 0.25 and the unit mass g = 2.7g cm~^ . The extracted volumes 
and their vicinity are filled by a goaf material. For the goaf material, we again 
assume a linear elastic behaviour with elastic constant E=2500 MPa, i^=0A and 
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Q = 2.0g cm~^. We use pure Dirichlet boundary conditions given by the solution 
of a pure Neumann problem modelling the pre- mining stress state. The global 
coarse grid has 124 x 137 x 76 nodes (3873264 unknowns), the local fine grid uses 
quater grid sizes, the fine mesh has 49 x 113 x 65 nodes (1079715 unknowns). 



X 




Fig. 3. “Dolni Rozmka” problem 



The numerical tests were done using both sequential and parallel codes. The 
sequential solution was performed on a workstation IBM RS/6000 43P-260 (the 
preconditioning for subproblems given by DiD), the parallel solution on a cluster 
of 9 PC’s with AMD Athlon 1.4 GHz 266MHz FSB processors (the precondi- 
tioning for subproblems given by DD). 

On the benchmark problems the following tests were performed: 
a) Tests on the preconditioners 

The tests were performed with five FAC-like preconditioners: 



1 . the nonsymmetric FAC starting from a fine grid 


(F- 


-.C) 




2. the symmetric FAC starting from a fine grid 


{F- 




■F) 


3. the nonsymmetric FAC starting from a coarse grid { C - 


-F) 




4. the symmetric FAC starting from a coarse grid 

5. the additive FAC (JFAC) 


(C- 


F - 


-C) 



To test the parallel code, tests were done only on Benchmark 11. The accuracy 
for CG on the composite grid was e = l.Oe— 10 (see Table 1). The tests show 
that we can use succesfully the nonsymmetric preconditioners and that orthog- 
onalization of the search directions is not necessary if we solve the subproblems 
more accurately (e = 0.01). The results for the JFAC preconditioner show that 
the “vertical” parallelization (the subproblems are solved in parallel) does not 
give any significantly improvement of the CPU time. 
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Table 1. The comparison of the preconditioners 







precond.l 


precond.2 


precond.3 


precond.4 


precond.5 


e = 0.1 


CG on CG-it. 


14 


13 


14 


11 


21 


ort.O 


GPU 


1876s 


2236s 


1819s 


2469s 


2417s 


e = 0.1 


CG on CG-it. 


11 


11 


12 


9 


19 


ort.l 


GPU 


1525s 


1719s 


1520s 


2030s 


2162s 


e = 0.1 


CG on CG-it. 


11 


11 


11 


9 


19 


ort.3 


GPU 


1628s 


1842s 


1549s 


2153s 


2334s 


e = 0.01 


CG on CG-it. 


8 


8 


8 


7 


15 


ort.O 


GPU 


1484s 


1678s 


1402s 


2077s 


2278s 


e = 0.01 


CG on CG-it. 


8 


8 


8 


7 


15 


ort.l 


GPU 


1476s 


1657s 


1374s 


2039s 


2242s 


e = 0.01 


CG on CG-it. 


8 


8 


8 


7 


15 


ort.3 


GPU 


1539 


1764s 


1482s 


2151s 


2408s 



b) Influence of the material inconsistency on the efficiency of the precondition- 
ers. 

In some problems the coarse grid is not able to model small nonhomogeneous 
details and subproblems are material inconsistent. If we use for solving such 
problems the FAC-like techniques, the algorithm converges for suitable damp- 
ing (see [1]). But if we use these techniques as preconditioners this material 
inconsistency can cause non-efficiency of these preconditioners. We tested the 
influence of material inconsistency on Benchmark I for three cases which differ 
by the value of Young modulus for the pillar in the coarse and fine grids: 

AO: E=31500 MPa for the coarse g., E=31500 MPa for the fine g. (mat. consist.) 
Al: E=1000 MPa for the coarse grid, E=31500 MPa for the fine grid 
A2: E=1 MPa for the coarse grid, E=31500 MPa for the fine grid. 

The results are shown in Table 2. The tests confirmed that strong material 
inconsistency can degrade the convergence of CG method with such a “precon- 
ditioner”. The reason is that the solution on a coarse grid gives in the case of 



Table 2. Influence of material inconsistency on preconditioned CG convergence 





mat. AO 


mat. Al 


mat. A2 


prec.l, £=0.1 


18 it., CPU = 470s 


142 it., CPU=1187s 


nonconvergent 


prec.2, £=0.1 


13 it., CPU = 357s 


22 it., CPU=282s 


it > 300 


prec.3, £=0.1 


16 it., CPU = 389s 


201 it., CPU=1411s 


nonconvergent 


prec.4, £=0.1 


14 it., CPU = 609s 


nonconvergent 


nonconvergent 


prec.5, £=0.1 


30 it., CPU = 552s 


121 it., CPU=1172s 


it > 300 


prec.l, £=0.01 


12 it., CPU = 501s 


23 it., CPU=393s 


nonconvergent 


prec.2, £=0.01 


11 it., CPU = 541s 


14 it., CPU=320s 


19 it., CPU = 350s 


prec.3, £=0.01 


11 it., CPU = 504s 


20 it., CPU=334s 


nonconvergent 


prec.4, £=0.01 


11 it., CPU = 956s 


nonconvergent 


nonconvergent 


prec.5, £=0.01 


25 it., CPU = 958s 


89 it., CPU=1348s 


it > 300 
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material inconsistency incorrect Dirichlet boundary conditions for the solution 
on a fine grid and a preconditioning provides pseudoresidual which does not cor- 
respond to the composite problem error. Tests show that in the case of strong 
inconsistency we must use the preconditioner 2 - the symmetric FAC method 
starting on a fine grid and we must solve subproblems more accurately. 

c) CPU time for various parts of the algorithm 

We solved Benchmark II with the sequential and parallel solvers to show how 
much time consuming various parts of the algorithm are when using the sequen- 
tial solver and how it can be improved using the parallel solver. The results are 
shown in Table 3. Note that the parallel an sequential codes run on different 
computers. We see that the most time consuming part of the sequential code is 
the solution of linear systems for the subproblems. Further CPU time reduction 
will be possible after introducing a parallel access to the composite matrix-vector 
multiplication. 



Table 3. CPU times for various parts of the algorithm 





CPU-seq. 


CPU-1 it. 


CPU-par 


CPU-1 it. 


A X u 


111.0s 




78.0 s 




A~^ X b, coarse 


752.0 s 


32.7s 


32.2 s 


1.5s 


A~^ X b, fine 


72.0 s 


12.0 s 


2.4 s 


0.6 s 
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Abstract. The simulation of sedimentary basins aims at reconstrncting 
its historical evolution in order to provide quantitative predictions about 
phenomena leading to hydrocarbon accumulations. The kernel of this 
simulation is the numerical solution of a complex system of non-linear 
partial differential eqnations (PDF) of mixed parabolic-hyperbolic type 
in 3D. A discretisation and linearisation of this system leads to very large, 
ill-conditioned, non-symmetric systems of linear equations with three 
unknowns per mesh cell, l.e. pressure, geostatic load, and oil saturation. 

This article describes the parallel version of a preconditioner for these 
systems, presented in its sequential form in [7]. It consists of three steps: 
in the first step a local decoupling of the pressure and saturation un- 
knowns aims at concentrating in the “pressure block” the elliptic part of 
the system which is then, in the second step, preconditioned by AMG. 

The third step finally consists in recoupling the equations. Each step 
is efficiently parallelised using a partitioning of the domain into verti- 
cal layers along the j/-axis and a distribnted memory model within the 
PETSc library (Argonne National Laboratory, IL). The main new ingre- 
dient in the parallel version is a parallel AMG preconditioner for the 
pressnre block, for which we use the BoomerAMG implementation in the 
hypre library [4]. 

Numerical results on real case studies, exhibit (i) a significant reduction 
of CPU times, up to a factor 5 with respect to a block Jacobi precondi- 
tioner with an ILU(O) factorisation of each block, (ii) robustness with 
respect to heterogeneities, anisotropies and high migration ratios, and 
(iii) a speedup of up to 4 on 8 processors. 

1 Introduction 

In recent years the 3D modelling and simulation of sedimentary basins has be- 
come an integral part in the exploration of present and future reservoirs for 
almost all major oil companies. It aims at reconstructing the historical evolu- 
tion of a sedimentary basin in order to provide quantitative predictions about 
phenomena leading to hydrocarbon accumulations. 

A simplified kinematic model of the geometry of the basin is first computed 
from the geology of the basin using a back-stripping algorithm. The full model 
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comprising hydrocarbon generation, heat transfer, compaction of the porous 
media, and multi-phase Darcy flow is then calculated forward in time adding 
layer after layer using the geometric history calculated during the back-stripping 
process. 

The most computing intensive part of the simulation consists in solving nu- 
merically a complex system of non-linear partial differential equations (PDEs) 
modelling the compaction of the porous media and multi-phase Darcy flow in 
3D. These equations are discretized using a cell-centred finite volume method 
in space and an implicit Euler scheme in time, and linearised using Newton’s 
method. In the case of two-phase flow (water - oil), we will therefore have to 
solve at each time step and in each Newton iteration, a system of coupled linear 
equations with three unknowns per element (pressure, geostatic load, oil satu- 
ration), very large, non-symmetric, and ill-conditioned, in particular due to the 
strong heterogeneities and anisotropies of the porous media. 

The idea for our preconditioning strategy is based on the methods developed 
in [2, 3, 5, 9] for the related problem of oil reservoir simulation and is described 
in detail in [7]. It consists of three steps. First of all the equations for pressure 
and saturation are locally decoupled by taking linear combinations of the two 
equations on each element. This decoupling aims not only at reducing the cou- 
pling, but also at concentrating in the pressure block the elliptic part of the 
system which requires a good global preconditioning. The second step consists 
then in preconditioning the pressure block by an algebraic multi-grid method. 
The recoupling of the pressure with the other unknowns is then in a final step 
achieved by applying a block Gauss-Seidel strategy or the combinative technique 
defined in [2] for the reservoir case. 

Each step of the preconditioner is efficiently parallelised using a partitioning 
of the domain into vertical layers along the y-axis and a distributed memory 
model within the PETSc library (Argonne National Laboratory, IL). Only slight 
modifications are necessary for most of the components of the sequential precon- 
ditioner in [7]. The main new ingredient which is necessary however, is a parallel 
preconditioner for the pressure block for which we use BoomerAMG [4] from the 
hypre package (Lawrence Livermore National Laboratory, CA). 

We present results with the sequential preconditioner for a new test case with 
very high migration ratios which was not available for [7]. For this test case, 
the Quasi-IMPES decoupling together with a recoupling using the combinative 
technique outperforms all other combinations suggested in [7] significantly. This 
suggests that in the case of high migration ratios a local decoupling of the two 
mass conservation equations does not reduce the global coupling of these two 
equations significantly. Moreover, Gauss decoupling and Householder decoupling 
destroy the ellipticity of the pressure block and thus AMG does not work very 
well anymore. All the performance tests (sequential and parallel) were therefore 
carried out with Quasi-IMPES decoupling. 

In a series of numerical tests, the performance of the preconditioner is com- 
pared to block Jacobi with an incomplete LU factorisation, ILU(O), of each block, 
previously the only way to precondition these linear systems efficiently in paral- 
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lei. We observe a very significant reduction of the CPU-time for the linear solver 
on one processor, up to a factor 5 with respect to ILU(O). The performance of 
the preconditioner shows no degradation with respect to the number of elements 
or to strong heterogeneities and anisotropies in the porous media, and depends 
only very mildly on the migration ratios of oil. Furthermore, the parallel solver 
exhibits good speedup on an SGI 0rigin2000. In all the test cases we obtain 
speedups between 3 and 4. 

The paper is organised as follows. In Section 2 we will give a detailed de- 
scription of the preconditioning strategy, followed by a series of numerical tests 
in Section 3. 



2 Preconditioning Strategy 



The mathematical model for the forward problem is formulated as a system 
of 4 coupled non-linear PDFs and accounts for the conservation equations of 
two phase Darcy flow (water and oil), as well as the conservation equation of 
vertical momentum describing the compaction of the porous media together with 
a rheology law relating the porosity to the geostatic load and to the average 
fluid pressure. We refer to [7] and [8] and to the references therein for a detailed 
description of the model. 

The equations are discretized using a finite volume scheme in space and a 
fully implicit time integration. After a Newton linearisation and an elimination 
of the rheology law we end up at each time step and in each Newton iteration 
with a system of linear equations of the form 



A5y 



^10, p 




/ 6p 




0 


Sa 




AO,S 1 


\ Ss 




f, 



( 1 ) 



where each one of the blocks in A is an x matrix with N being the 
number of mesh cells. The block rows correspond to the discretisations of the 
water conservation equation, the vertical compaction equation and the oil con- 
servation equation, respectively. The block columns correspond to the derivatives 
with respect to the primary variables p, cr, s (i.e. average pore pressure, geostatic 
load and oil saturation), respectively. The details of the discretisation and the 
linearisation can be found in [7]. 

System (1) is very large, non-symmetric and ill-conditioned, in particular 
due to the strong heterogeneities and anisotropies of the porous media. For an 
efficient and robust iterative solution of this system, it is therefore crucial to find 
good preconditioners. 

The idea for our preconditioning strategy stems from the related field of oil 
reservoir simulation, and was first presented for sedimentary basin simulations 
in [7]. In reservoir simulations, the geological time spans that are simulated 
are much shorter in comparison to the time spans considered here (i.e. years 
instead of million years), so that the porosity and the effective stress a can 
be kept constant, thus leading to systems similar to (1), but in pressure and 
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saturation only. In [2,3,5] a preconditioning strategy has been developed for 
these problems decoupling the treatment of pressure and saturation. In [7] we 
have used a similar strategy for (1) which consists of three steps: (i) local 
decoupling of the two mass conservation equations on each cell I7fc of the mesh (or 
in other words, choosing an appropriate equation for the average pore pressure in 
J7fe), (ii) preconditioning of the pressure unknowns, (iii) preconditioning of the 
remaining unknowns (Block Gauss-Seidel), or if this is not sufficient, feedback- 
coupling with the pressure equation (two-stage preconditioners) . We will describe 
these three steps briefly and explain in particular how they are parallelised. In 
order to do this we need to first describe the parallelisation strategy. 



2.1 Parallelisation Strategy 

At the moment, the parallelisation uses a partitioning of the domain into verti- 
cal layers along the y-axis and a distributed memory model. However, this one- 
dimensional topology for the parallel model is inherited from the parallelisation 
of the entire sedimentary basin simulation code and the preconditioner itself is 
not restricted to a partitioning into vertical layers. Communications between the 
processors and synchronisation are handled using MPI (Message Passing Inter- 
face) within the PETSc library (Argonne National Laboratory, IL) on the basis 
of a ghost point strategy (see [1] for details) . 

The parallelisation of each individual component of the preconditioner pre- 
sented in [7] is discussed in the following. 



2.2 First Step: Local Decoupling Choice of the Pressure Equation 

The decoupling step corresponds to multiplying system (1) by 



/Diag(7r)0Diag(7r)' 
:= 0/0 
\Diag( 7 ®’“') 0 Diag( 7 fc’°), 

and leads to the transformed system 



A5y ■= 



AP AP’'^ AP- 
A'^’P A'^ 0 

J^s,p T^s.cr 




= ■■ f 



( 2 ) 



( 3 ) 



with A := Vjr ^A and / := Vjr ^ f. This corresponds to a left preconditioning 
of (1) by The matrix ^ is chosen such that 

(a) 

(b) the block AJ’ concentrates the elliptic part of the system, 

(c) the norm of P^ ^ is 0(1). 
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In [7] we proposed three different choices for based on Gaussian elimina- 
tion, on Householder reflection, and on a quasi-explicit treatment of the satura- 
tion (Quasi-IMPES), respectively. Unfortunately none of these choices satisfies 
all the three criteria above. The first choice satisfies (a), but does not nec- 
essarily satisfy (b) and (c) very well; the second choice satisfies (a) and (c), 
but again does not satisfy (b) very well; the third choice on the other hand 
does not satisfy (a), but satisfies (b) and (c). We will analyse the importance of 
these three criteria more carefully in Section 3.1. 

No additional work is necessary to parallelise this step, since all the operations 
are local to each mesh cell and thus local to each processor. 

2.3 Second Step: Pressure Block Preconditioner 

From the decoupling step, the block should concentrate the ellipticity, and 
thus the ill-conditioning of the system and basically correspond to a discrete dif- 
fusion operator with a strongly heterogeneous and anisotropic diffusivity tensor. 
We propose to use algebraic multi-grid (AMG) which is known to be efficient 
and robust for such stiff elliptic linear systems, even in the presence of strong 
heterogeneities and anisotropies. More precisely, we shall use BoomerAMG by Hen- 
son and Yang [4], a very efficient and robust parallel implementation of AMG 
within the hypre software package (Lawrence Livermore National Lab, GA). 

To test whether this implementation is competitive in the sequential case, we 
will also include some results in Section 3 comparing BoomerAMG on one processor 
with the implementation AMG1R5 by Ruge and Stiiben [6] which was used in [7]. 

2.4 Third Step: Recoupling 

The decoupling step is aimed at reducing the dependency of the pressure equa- 
tions on the saturation. Thus, provided the decoupling is effective, we can assume 
that the block is small compared to . Furthermore, physical considera- 
tions also allow us to make the same assumption about the block A^’^ (see [7] 
for details). This leads to the idea for the following Block Gauss-Seidel (BGS) 
preconditioner for (3): 



/ iPP 0 0 \ 

Vscs ■■= A- 0 (4) 

\A^’P A®’'" vy 

Here, is the AMG preconditioner for the pressure block A^ discussed above, 
and 7^® is a preconditioner for the saturation block A®. Since the dependency 
on saturation is hyperbolic or transport dominated parabolic, it is sufficient to 
use a simple relaxation scheme, like Jacobi or Gauss-Seidel, for 7^®. In Section 3 
we will use one iteration of a hybrid version of Jacobi and Gauss-Seidel for 7^® 
which involves no communication between the different processors and reduces 
to Gauss-Seidel on each processor. Finally, since the compaction equation is one 
dimensional and can be solved in each column independently, the block A^ can 
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be cheaply inverted by back substitutions. This is also the case in the parallel 
version, provided each vertical mesh column is assigned to one processor. 

The assumption that the blocks and can be neglected may be too 
strong. It is necessary in such a case that the preconditioner also provides some 
feedback from the second and the third block row to the first. A way of providing 
such a feedback-coupling is to combine the Block Gauss-Seidel preconditioner 
in a multiplicative way with a second preconditioner that provides (at 
least locally) such a coupling and that is cheap to invert. A good candidate 
for such a preconditioner is zero fill-in incomplete LU factorisation, ILU(O), 
of A. It is cheap and provides a good local coupling of the different physical 
unknowns. Again to reduce communication in the parallel version we carry out 
the incomplete LU factorisation only locally on each processor. In other words, 
the second preconditioner is chosen to be a block Jacobi preconditioner with an 
ILU(O) factorisation of each block, denoted The combined preconditioner 

can then be written as 

'Pa.l ■■= + V~l{l - AV-^g) . (5) 

Similar techniques in reservoir simulations, notably in [2] and [5], are usually 
referred to as combinative two-stage preconditioners (C2S) and we will adopt 
the same name. 

The recoupling preconditioners Vg^g and are used as right precondition- 
ers for (3). And so in summary (putting all three steps together) we obtain the 
following preconditioned system: 

A6y := {r^^AV^^){VnSy) = =: f (6) 

where is either V~g, V~^ or and is either V~^g or V~^l. 

3 Numerical Results 

In this section we will discuss the results of the numerical tests which we effec- 
tuated to evaluate the performance of the preconditioner in terms of efficiency 
and robustness on some test cases from real case studies. The five test cases 
are subsections of sedimentary basins taken from real case studies of major oil 
companies. These studies are confidential and so we will refer to the test cases as 
Problems A - E. Problems A and D are problems with strong heterogeneities 
and anisotropies in the permeability tensors of the porous media but with low 
migration ratios, and thus represent almost single-phase flow. Problems B and 
C are problems with higher migration ratios, but without any strong hetero- 
geneities in the permeability tensors. Problem E, on the other hand, comprises 
both these difficulties and is thus the hardest test case (not yet available in [7]). 
We will only regard a set of time steps towards the end of the simulation, when 
most the z-layers are deposited and the linear systems are the largest. The num- 
ber of unknowns ranges from 16974 for Problem A to 129756 for Problem E. 
Unfortunately, larger problems of entire sedimentary basins were not available 
for reasons of confidentiality. 
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As the iterative solution method for the preconditioned systems (6) we use 
the Bi-CGStab method by Van der Vorst [10] which is particularly suited for non- 
symmetric linear systems. The stopping criterion is the reduction of the residual 
by a factor of e = 10“®. The CPU-times are those obtained on an SGI 0rigin2000 
using between 1 and 8 processors. We will compare the results obtained with 
our preconditioner with the results using block Jacobi preconditioning with an 
incomplete LU factorisation ILU(O) of each block, previously the only way to 
precondition these linear systems efficiently in parallel. 



Table 1. Comparison of the different decoupling/recoupling strategies - Iterations 



Problem 


N 


'Pbilu 


Block Gauss-Seidel 
Vge Vqi 


Combinative Two-Stage 
Vge Vqi 


A 


16947 


47.8 


4.2 


4.2 


3.5 


3.5 


B 


23703 


20.1 


9.4 


10.2 


4.6 


3.8 


C 


69312 


21.0 


13.0 


13.6 


2.8 


6.0 


D 


94089 


103.8 


4.6 


4.6 


3.8 


3.8 


E 


129756 


138.1 


111.0 


88.7 


55.4 


8.4 



As we can see in Table 1, the performance of ILU(O) deteriorates rather 
strongly as the mesh size gets larger even in the sequential case, in particular 
for problems with strong heterogeneities (i.e. A, D, and E). Moreover, the block 
Jacobi parallelisation of it provides no global preconditioning at all and will even- 
tually be not sufficient anymore for convergence (when the number of processors 
gets too large). Increasing the fill-in for the incomplete LU decomposition does 
not cure these problems, and the reduction in the number of iterations is in all 
cases weighed off by the extra cost of each iteration in terms of efficiency. 



3.1 Comparison of the Different Decoupling/Recoupling Strategies 

To compare the robustness of the different decoupling and recoupling strategies 
which have been proposed in Section 2, we will look at the average number of Bi- 
GGStab iterations per Newton iteration in the sequential case. These are given 
in Table I. 

Based on the results for Problems A - D, we concluded in [7] that the choice 
of decoupling has no strong influence on the robustness of the preconditioner, but 
that for problems with high migration ratios the block Gauss-Seidel recoupling 
was not sufficient to guarantee the robustness of the preconditioner anymore. 
The combinative two-stage preconditioner, on the other hand, proved to be very 
robust in all cases. 

The new test case E however, which combines both strong heterogeneities 
and high migration ratios, gives us some very valuable new insight into the 
robustness of our preconditioner. In particular, we can observe that Gauss de- 
coupling, Vge (as well as Householder decoupling Vhh)^ lead to a much worse 
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performance than Quasi-IMPES decoupling Vqi, i.e. 55.4 iterations for Vge in 
comparison to 8.4 iterations for Vqi using two-stage recoupling. This is probably 
due to the loss of ellipticity in the pressure block. The results also underline 
(even more dramatically) that block Gauss-Seidel recoupling is not sufficient for 
problems with high migration ratios, i.e. 88.7 iterations for block Gauss-Seidel 
in comparison to 8.4 iterations for the two-stage recoupling (using Vqi). Thus, 
we can conclude that in the case of high migration ratios a local decoupling 
of the two mass conservation equations does not reduce the global coupling of 
the two equations significantly. However, the combinative two-stage precondi- 
tioner provides the necessary recoupling and is very robust in combination with 
Quasi-IMPES decoupling Vqi (see the final column in Table 1). 

3.2 Comparison of the Pressure Preconditioners 

In order to allow for a parallelisation of AMG the BoomerAMG implementation 
by Henson and Yang [4] differs from the original AMG code AMG1R5 [6] in its 
coarsening strategy. (Here we use Falgout coarsening within BoomerAMG.) To 
test whether BoomerAMG is competitive as the pressure preconditioner V^ in the 
sequential case, we compare in Table 2 the results for Problems D and E using 
one V-cycle of BoomerAMG on one processor with the results using one V-cycle 
of AMG1R5 [6]. We see that apart form slight variations they perform almost 
identically. 



Table 2. Comparison of pressure preconditioners V^\ Problems D (left) and E (right) 



pp 


Iterations 
Vbgs Vcss 


CPU-Time 
'Pegs Pc2S 


Iterations 
Vbgs Vgss 


CPU- Time 
Vbgs Vgss 


BoomerAMG 
AMG1R5 
exact - LU 


4.4 3.4 
4.6 3.8 
3.0 1.8 


6.1s 6.6s 

5.0s 6.0s 


76.3 7.8 

88.7 8.4 

58.4 6.8 


81.4s 16.2s 
86.9 s 15.7s 



We also include (in Table 2) the number of iterations for our preconditioner 
when the pressure block is inverted exactly using an LU factorisation of . The 
number of iterations decreases at most by a factor of 2 which is astonishing. 



3.3 Comparison with Block Jacobi Parallel Speedup 

Finally and most importantly we look at the parallel robustness and efficiency of 
our preconditioner and compare it to block Jacobi preconditioning with ILU(O) 
per block, Vbilu- The results (with Quasi-IMPES decoupling) for Problems D 
and E are given in Tables 3 and 4, respectively. 

First of all, observe the extremely good parallel robustness of our precondi- 
tioner for all of the recoupling strategies. The number of iterations hardly grows 
at all with the number of processors. In terms of CPU-time, our preconditioner 
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Table 3. Comparison with Block Jacobi: Problem D { N = 94089 ) 



Processors 


Iterations 
'Pbilu 'Pbgs 


1^023 


CPU-Time 
"Pbilu Pegs Pc2S 


1 


103.8 


4.4 


3.4 


29.4 s 


6.1s 


6.6 s 


2 


103.2 


4.2 


3.2 


16.7s 


4.5 s 


4.7s 


4 


109.2 


4.4 


3.4 


6.9 s 


2.8 s 


2.9 s 


8 


116.6 


4.6 


3.4 


3.0 s 


2.0 s 


2.3 s 



Table 4. Comparison with Block Jacobi: Problem E { N = 129756 ) 



Processors 


Iterations 
Pbilu Pegs 


'Pass 


CPU-Time 
Pbilu Pegs Pc2S 


1 


138.1 


76.3 


7.8 


56.3 s 


81.4s 


16.2s 


2 


147.3 


78.3 


8.2 


35.6 s 


56.5 s 


11.9s 


4 


167.8 


76.0 


8.0 


18.0 s 


32.3s 


7.0 s 


8 


191.5 


78.6 


8.0 


8.2 s 


19.8s 


4.4 s 





Fig. 1. Parallel Speedup: Problems D (left) and E (right) 



is more than 3 times more efficient than Vbilu on one processor and the parallel 
speedup is very good: between 3 and 4 on 8 processors (see Figure 1). This is in 
fact about the same speedup than the one reported in [4] for BoomerAMG on more 
simple test problems. Finally, it is interesting to note that the modification of 
the combinative two-stage recoupling seems to lead to a better efficiency of the 
preconditioner both in terms of iterations and CPU-time and that this method 
therefore is our method of choice. 
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Abstract. The impact of quantum mechanical space-quantization ef- 
fects on the operation of a narrow-width SOI device structure has been 
investigated. The presence of a two-dimensional carrier confinement 
(both vertical and along the width direction) gives rise to larger average 
displacement of the carriers from the interface and lower sheet electron 
density in the channel region. This, in turn, results not only in a signif- 
icant increase in the threshold voltage but also in pronounced channel 
width dependency of the drain current. In these simulations we have used 
classical 3D Monte Carlo particle-based simulations. Quantum mechan- 
ical space-quantization effects have been accounted for via an effective 
potential scheme that has been quite successful in describing bandgap 
widening effect and charge set back from the interface in conventional 
MOSFET devices. 



1 Introduction 

For quite some time, the dimensions of semiconductor devices have been scaled 
aggressively in order to meet the demands of reduced cost per function on a chip 
used in modern integrated circuits. There are some problems associated with 
device scaling, however. Critical dimensions, such as transistor gate length and 
oxide thickness, are reaching physical limitations. Considering the manufactur- 
ing issues, photolithography becomes difficult as the feature sizes approach the 
wavelength of ultraviolet light. In addition, it is difficult to control the oxide 
thickness when the oxide is made up of just a few monolayers. In addition to the 
processing issues, there are also some fundamental device issues. As the oxide 
thickness becomes very thin, the gate leakage current due to tunneling increases 
drastically. This significantly affects the power requirements of the chip and the 
oxide reliability. Short-channel effects (SCEs), such as drain-induced barrier low- 
ering (DIBL) and the Early effect in bipolar junction transistors (BJTs), degrade 
the device performance. Hot carriers also degrade device reliability. The control 
of the density and location of the dopant atoms in the MOSFET channel and 
source/drain region to provide a high on-off-current ratio becomes a problem in 
the 50 nm node devices where there are no more than 100 impurity atoms in the 
device active region. Yet another issue that becomes prominent in nano-scale 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 105-111, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 



106 Shaikh S. Ahmed and Dragica Vasileska 



MOSFET devices is the quantum-mechanical nature of the charge description 
which, in turn, gives rise to inversion layer capacitance comparable to the oxide 
capacitance. This, in turn, degrades the total gate capacitance and, therefore, 
the device transconductance. 

A solution to some of the above mentioned problem, in order to achieving en- 
hanced device performance, is to use silicon-on-insulator (SOI) materials. Over 
the last three decades, SOI CMOS has been identified as one possible method 
for increasing the performance of CMOS over that offered by simple scaling. 
However, prior to 1990’s, SOI had not been suitable as a substrate for main- 
stream applications. The barriers to its widespread usage were many, the main 
ones being SOI material quality, device design, and the steady progress in bulk 
CMOS performance through scaling. In view of these problems, IBM launched 
the first fully functional SOI mainstream microprocessor in 1999 marking that 
SOI technology was becoming the state-of the art technology for future low- 
power ICs. An SOI SIMOX (separation by implanted oxygen) substrate with 
partially depleted epitaxial films (greater than 0.15 micron) was used for this 
purpose following a 0.22-micron technology. With this effort, the IBM specifica- 
tions predict a 25 — 35% improvement over bulk CMOS counterpart, which is 
equivalent to about two years of progress in bulk CMOS design and fabrication 
process [14]. 

The above described SOI devices are not without problems, however. The 
thin gate oxide gives rise to gate leakage and the quantum-mechanical charge 
description due to the vertical confinement leads to device transconductance 
degradation. The quantization of the channel in the FD device along the growth 
direction gives rise to threshold voltage increase. This is a consequence of the 
fact that there is a reduction in the sheet electron density due to the smaller 
density of states function of the quasi-two-dimensional electron gas (Q2DEG) 
and the vanishing of the wavefunctions at the Si/Si 02 interface because of the 
large potential barriers (3.24eV). In a recent work by Majima et al. [13], it has 
been shown that, if the channel width is on the order of 10 nm or so, quantization 
occurs in the lateral direction as well. Even though this device structure has top 
oxide thickness of 34 nm, it serves as a proof of concept that a two-dimensional 
quantization effects can give rise to significant threshold voltage shift in nano- 
scale SOI device structures. It is the purpose of this paper to investigate the 
observed threshold voltage shifts in the device structure fabricated by Majima 
et al. [13] due to quantum-mechanical space-quantization effects using the effec- 
tive potential scheme proposed by Ferry et al. [1] . It is worth noting that in some 
previous investigations we have shown that the effective potential approach can 
be successfully used for the case of two-dimensional carrier confinement effects 
in a narrow width wires, and the simulation results for the line electron den- 
sity obtained by using the effective potential approach and the self-consistent 
solution of the 2D Schrodinger-Poisson problem can be found in [1]. The incor- 
poration of the effective potential approach into a Monte Carlo particle-based 
simulator allows us, for the first time, to investigate the device transfer and 
output characteristics with proper treatment of two-dimensional quantization 
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effects, velocity overshoot and carrier heating on an equal footing. Previously, 
this scheme has been applied to a conventional MOSFET device structure in 
which only one-dimensional quantum-mechanical size-quantization effects have 
been considered [7]. 

2 Theoretical Model 

The inclusion of quantum effects in the description of the inversion layer at the 
semi-conductor/oxide interface of a SOI device involves solving the Schrodinger 
equation for the carriers in an approximately rectangular potential well. As a 
result, one obtains bound states, which give rise to two major features: reduced 
sheet density and charge set-back. Since the lowest bound state can be regarded 
as the new bottom of the conduction band, the spacing between the Fermi level 
and the conduction band edge is effectively increased, which results in the re- 
duced sheet charge density with respect to the case in which quantum effects 
are excluded. Moreover, the probability density in the lowest bound state now 
has a maximum away from the top semiconductor/oxide interface, resulting in 
charge displacement from the oxide, which accounts for an effective increase in 
the oxide thickness. 

An alternative to solving the Schrodinger wave equation is to use quantum 
potentials. The idea of quantum potentials originates from the hydrodynamic 
formulation of quantum mechanics, first introduced by de Broglie and Madelung 
[4, 5, 12], and later developed by Bohm [2, 3]. In this picture, the wave function is 
written in complex form in terms of its amplitude and phase. When substituted 
back into the Schrodinger equation one arrives at coupled equations of motion 
for the density and phase. The equations arising from this so-called Madelung 
transformation to the Schrodinger equation have the form of classical hydrody- 
namic equations with the addition of an extra potential, often referred to as the 
quantum or Bohm potential. The Bohm potential essentially represents a field 
through which the particle interacts with itself. It has been used, for example, 
in the study of wave packet tunneling through barriers [6], where the effect of 
the quantum potential is shown to lower or smoothen barriers, and hence allow 
for the particles to leak through. 

In analogy to the smoothed potential representation discussed above. Ferry et 
al. [7] suggested an effective potential that emerges from a wave packet descrip- 
tion of particle motion, where the extent of the wave packet spread is obtained 
from the range of wavevectors in the thermal-distribution function (character- 
ized by an electron temperature) . The effective potential, 14// , is related to the 
self-consistent Hartree potential, V, obtained from the Poisson equation, through 
an integral smoothing relation 



where G is a Gaussian with the standard deviation uq. The effective potential 
is then used to calculate the electric field that accelerates the carriers in the 
transport kernel of the simulator described below. 




( 1 ) 



108 Shaikh S. Ahmed and Dragica Vasileska 



Regarding the Monte Carlo model, used in the transport portion of the simu- 
lator, it is based on the usual Si band-structure for three-dimensional electrons in 
a set of non-parabolic zi- valleys with energy-dependent effective masses. The six 
conduction band valleys are included through three pairs: valley pair 1 pointing 
in the (100) direction, valley pair 2 in the (010) direction, and valley pair 3 in the 
(001) direction. The explicit inclusion of the longitudinal and transverse masses 
is important and this is done in the program using the Herring- Vogt transforma- 
tion [9] . Intravalley scattering is limited to acoustic phonons. For the intervalley 
scattering, we include both g- and /-phonon processes. At present, impact ion- 
ization, surface-roughness and Coulomb scattering are not included in the model. 
They are omitted as they tend to mask the role of the space-quantization effects 
on the overall device performance. 

In solving Poisson’s equation, the Monte Carlo simulation is used to obtain 
the charge distribution in the device. We use the Incomplete Lower-Upper de- 
composition method for the solution of the 3D Poisson equation. After solving 
for the potential, the electric fields are calculated and used in the free-flight por- 
tion of the Monte Carlo transport kernel. However, the charges obtained from 
the EMC simulation are usually distributed within the continuous mesh cell in- 
stead of on the discrete grid points. The particle mesh method (PM) is used to 
perform the switch between the continuum in a cell and discrete grid points at 
the corners of the cell. The charge assignment to each mesh-point depends on 
the particular scheme that is used. A proper scheme must ensure proper coupling 
between the charged particles and the Coulomb forces acting on the particles. 
Therefore, the charge assignment scheme must maintain zero self-forces and a 
good spatial accuracy of the forces. To achieve this, two major methods have 
been implemented in the present version of the code: the Nearest-Grid-Point 
(NGP) scheme and the Cloud-in-Cell (CIC) scheme. The cloud-in-cell scheme 
(CIC) produces a smoother force interpolation, but introduces self-forces on 
non-uniform meshes. These issues have been dealt with extensively by Hockney 
and Eastwood [10], and quite recently by Laux [11]. 

Note that the electric field used in the free-flight portion of the simulator 
is calculated using the effective potential approach due to Ferry. Because of 
the presence of the semiconductor-oxide barrier, the convolution of the Hartree 
potential, obtained from solving the Poisson equation, and a Gaussian function 
with smoothing parameter of 0.63nm along the growth direction and Inm along 
the channel (from source to drain) gives rise to electric fields that repel the 
carriers in the vicinity of the top semiconductor/oxide interface. This gives rise 
to larger average displacement of the carriers from the interface and degradation 
of the total gate capacitance due to the increase of the effective oxide thickness. 

3 Simulation Results 

The device structure we simulate consists of a thick silicon substrate, on top of 
which is grown 400nm of buried oxide. The thickness of the silicon on insulator 
layer is 7nm, with p-region width of lOnm. The channel length is 50nm and the 
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doping of the p-layer is 10^® cm“^. On top of the SOI layer sits a gate-oxide layer, 
the thickness of which is 34nm. This is a rather thick gate oxide, but we have 
used it to compare our simulation results with the experimental data of Majima 
et al. [13]. The doping of the source/drain junctions equals 10^® cm“^, and the 
gate is assumed to be a metal gate with workfunction equal to the semiconductor 
affinity. 




Channel Width [nm] 



Fig. 1. Width dependence of the threshold voltage for Vb = O.IV. 



The use of the effective potential shifts the conduction band edge upwards 
along the width and the depth of the device. It, therefore, accounts for the so- 
called band-gap widening effect, which leads to a reduction of the carrier density 
at the interface. Also, the electrons are moved away from the interface because of 
the additional perpendicular electric field in the vicinity of the Si-Si 02 interface. 

The quantization of charge in the inversion layer produces an expected in- 
crease in the threshold voltage of the device. The increase in the threshold voltage 
in the structure being investigated is due to both lateral and vertical confine- 
ment. The width dependence of the threshold voltage is shown in Figure 1. The 
sharp decrease in the threshold voltage with increasing channel width can be 
explained by increase in the channel conductance due to less quantized carriers. 
Note that similar trends have been observed in the experimental device structure 
from [13]. The slight discrepancy between the theory and the experiment can be 
due to two factors. First, the experimental device is grown on a (110) substrate 
and the theory is performed for a (100) substrates. Second, the threshold voltage 
in [13] is extracted using the constant current technique, whereas we have used 
the normalized mutual integral difference method due to He and coworkers [8]. 
The normalized mutual integral difference method utilizes the exponential-linear 
characteristic of the drain current versus gate voltage to obtain the drain voltage. 
As such, it is much better than any of the conventionally used methods includ- 
ing constant current technique, the linear extrapolation technique, the second 
derivative technique, the quasi-constant current technique, etc. 
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4 Conclusions 

In this work, we have utilized the effective potential approach to successfully 
simulate two dimensional space quantization effects in a model of a narrow- 
channel SOI device structure. The effective potential provides a set back of the 
charge from the interface, and a quantization of the energy within the channel. 
Both of these effects lead to an increase in the threshold voltage. We find a 
threshold voltage shift of about 250mV when the effective potential is included 
in the model for a device with lOnm channel width. Note that all simulations 
have been performed on a dual processor PC using Fortran 90. Also note that the 
scheme proposed by Ferry for incorporation of the quantum-mechanical space- 
quantization effects leads to only 10% increase of the overall CPU time. The 
solution of the 2D Schrodinger equation along slices of the device would be 
much more computationally costly. 
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Abstract. The problem of tracking a reentry ballistic object by pro- 
cessing radar measurements is considered in the paper. Sequential Monte 
Carlo-based filter is proposed for dealing with high nonlinearity of the 
object dynamics. A multiple model configuration is incorporated into 
the algorithm for overcoming the uncertainty about the object ballistic 
characteristics. The performance of the suggested multiple model particle 
filter (PE) is evaluated by Monte Carlo simulation. 



1 Introduction 

The flight of a ballistic object (BO) consists of three phases: boost, ballistic (free 
flight) and terminal (reentry) phase. The topic of investigation in the present 
paper is the object reentry trajectory, when the Earth’s atmosphere is reached 
and the atmospheric drag becomes considerable. This phase finishes with the 
landing point on Earth [11] . The precise estimation of BO motion parameters and 
the accurate prediction of its impact point are of great importance for defense 
and for safety against reentry of old satellites and falling debris [6] . 

There exist a lot of extensive studies of BO tracking by processing radar mea- 
surements or using electro optical sensors. The challenges that must be overcome 
are connected primarily with high nonlinearity of the dynamic system, formed 
by BO and sensor. In addition, difficulties arise due to the uncertainty of the 
ballistic object parameters and noisy observations, received by the sensor. 

There are two major approaches to nonlinear recursive estimation of the 
object dynamics [13]. The nonlinear state and measurement equations in the 
first one are approximated by Taylor series expansions. The linearized (about the 
current state estimate) functions are applied directly to the linear Kalman Alter 
to produce Extended Kalman Alter (EKE). Various algorithms based on EKE for 
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the reentry phase are proposed in the literature, resulting in different trade-offs 
between accuracy and computational costs. However, when the filtering problem 
is highly nonlinear, the local linearity assumption breaks down and the EKF 
may introduce large estimation errors. The second approach is based on the 
approximation of the underlying density functions of the state vector. Recursive 
algorithms on the densities can be derived from Bayes’formula. These density- 
based algorithms are preferable in the cases of both nonlinear equations and 
non-Gaussian error terms. The unscented Kalman filter and PF belong to this 
class. The variants of these filters are suggested in [6] for the purposes of reentry 
vehicle tracking. 

BO uncertainty during the reentry phase is concentrated on the drag charac- 
teristics [11]. In general, the ballistic profile of the foreign vehicle under surveil- 
lance is not known and must be estimated for precise tracking. A natural tech- 
nique is to include drag (or another suitable quantity) into the state vector as 
a state to be evaluated [3,4,7]. The authors of [12] adopt an uncertain interval 
for a key system parameter (the atmospheric density gradient). Then the entire 
tracking system is modeled as an interval nonlinear system and analyzed via an 
extended interval Kalman filter. 

A Monte Carlo filter is proposed in the work for BO radar tracking. Since 
BO dynamics are governed by a stochastic differential equation (SDE), the time 
evolution of the BO state probability density follows a Fokker-Plank equation 
(FPE). A numerical solution of SDE is realized in order to approximate the 
probability density of the predicted state. The objective is to refine the pro- 
posal distribution and consequently to improve the overall estimation accuracy. 
A multiple model approach is proposed also for identifying the unknown drag 
parameter (DP). It is assumed that the ballistic coefficient (the inverse of DP) 
belongs to a discrete set of possible values. 

The rest of the paper is organized as follows. The problem under consider- 
ation is set up in Section 2. Section 3 presents the features of BO SDE and its 
approximate solution. The suggested PF is described in Section 4. Computer 
simulation results are shown in Section 5, with conclusions given in Section 6. 



2 Problem Formulation 

BO and Sensor Models. The starting point for modeling time evolution of BO 
state X( is the Ito SDE [8, 9] 



dxt = H^t,t)dt + g{xt,t)dwt ( 1 ) 

where Xj is an n-dimensional vector, f is an n-dimensional vector valued function, 
g is an n X p matrix with Q = g(x, t)g(x, t)', the system noise covariance matrix 
and wt is a p-dimensional Wiener process. 

The function of tracking filter is to estimate xt at time t^ (xj^, = x^), 
given a set of measurements zi,fc = {zi,i = l,...,fc}, made at discrete times 
b, O, • ■ • , tk- 
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The measurement model is given by the equation 

Zfc = h(xfc,tfc) + Vfc, A: = 1,2,... (2) 



where z^, is a m-dimensional measurement vector, h is a measurement function 
and Vfc is a Gaussian white noise sequence with zero mean and covariance R. 

Density-based Filtering. The objective of density-based filtering methods is to 
construct the state posterior probability density function (PDF) p(x,tfe|zi,fe) 
based on all available information. According to the Bayesian philosophy, 
p(x,tfc|zi:fc) may be obtained recursively within the time interval [tk-i^tk] in 
two stages - prediction and update [5] : 

Prediction. PDF evolution p (x, |zi:fc_i) ^ p (x, tfe|zi:fe_i) is connected 
directly with the Ito equation (1). It is given by the FPE [8] 

| = -V,(f(x,t)p)+y:^^(Q/2)„p (3) 



where the initial condition p (x, |zi;fc_i) is available from the previous time 
step tk-\. It is assumed that the initial PDF p(x,to|zo) = p(xq) is known. 

Information Update. When the measurement Zfc arrives, the update stage 
p(x,tfc|zi:fc_i) ^ p(x,tfe|zi:fe) is realized: 



p(x,tfe|zi:fe) 



p(zfc|x)p(x,tfc|zi,fc_l) 

/ p(Zfc|x)p(x,tfc|Zl:fc_l)dx 



(4) 



The normalizing constant p (zfe|zi:fe_i) in the denominator depends on the likeli- 
hood function p (z^jx), determined by the parameters of the measurement model 
(2). The measurements are assumed conditionally independent in time. 

All information about the state xt^., contained in the measurements zi:fe, is 
captured in the conditional posterior PDF (4). An optimal (with respect to any 
criterion) estimate of the state and a measure of its accuracy can be obtained 
from p (x, tfe|zi,fe) [1,5]. In particular, the mean Xk = f xp(x,tklzi-.k) dx can be 
BO state estimate solving in this way the filtering problem. 

The recursive propagation (3) and (4) of the PDF is only a conceptual solu- 
tion and can be implemented exactly for a restrictive set of cases [I] . Whereas the 
simulation-based methods offer efficient approximations to this optimal Bayesian 
solution. 

Multiple Model (MM) Approach. One of the problems connected with BO track- 
ing is that the unknown DP reflects on the uncertainty about the state tran- 
sition function f(x(,t) in equation (I). However, in practice the drag values 
are positioned in an a priori known fixed interval. MM approach is a widely 
used methodology for overcoming this problem. It is assumed that the object 
state belongs to one of r models Mj,j = I,...,r, corresponding to one of r 
discrete values in the drag interval. The prior probability that Mj is correct 
P {Mj\zo} = Pj(0) is assumed known and = I- 
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r independent model-matched filters are run in parallel to produce model-condi- 
tioned state estimates j = 1, . . . ,r). Given the measurement data zi-k, the 
posterior probability of model j being correct can be calculated using Bayes’ 
formula [2]: 



= P{Mj\zi,k} 



p (zk\yii^ - 1) 



(5) 



The output estimate Xk is the weighted sum of the individual estimates (com- 
bined estimate) or the output from the “best” filter (with maximum posterior 
probability) . 



3 Numerical Solution of BO SDE 



During the reentry phase, two significant forces act on the object [11]: the Earth’s 
gravity and the atmospheric drag. The drag force acts opposite to the velocity 
vector Vr, with an intensity, proportional to the air density p and the square of 
the object speed Vr- The drag induced acceleration is given by [11] 

an = ~l:ap{h)v'^Uy for u„ = — (6) 

2 Vr 



where h denotes the object altitude, and a is the drag parameter, depending on 
object mass, body shape, cross-sectional object area perpendicular to the veloc- 
ity, Mach number, drag coefficient and other unknown parameters. Therefore, 
for the reentry tracking, all uncertainties associated with the drag are collected 
in the DP a [11]. 

The air density p is usually approximated as a locally exponential function 
of the object altitude p{h) = poexp~'^^, where po and k are known constants 
for a particular layer of the atmosphere. In the simulation below are accepted 
the following constant values for po and k: po = 1.227, k = 1.0931 • 10“^ for 
h < 9144 m, according to [6]. 

Assuming a flat, nonrotating Earth model, BO motion model has the follow- 
ing state-space form in the sensor East-North-Up coordinate system: 



X 
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y. 




9_ 
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(7) 



where g is the gravity acceleration and q is the process noise intensity. 

In the terminology of SDE (1), the state vector x = [a; x y y]' includes 
BO positions and velocities in the radar-centered Cartesian coordinate system; 
f(x) and g(x) have the following form: 





X 




o 

o 


/(x) = 


— lapo exp”"’*^ x'^ + y"^ X 

y 


g(x) = 


y/q 0 
0 0 




_ -g - iapo exp”"’*^ \/x'^ + y'^y_ 




L 0 



( 8 ) 
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Euler approximation is used to generate approximate sample paths of a so- 
lution of BO SDE. For a given equidistant discretization of the time interval 
A = (tfc — tk-i)/L, the Euler approximation realizes the following 
iterative scheme [10]: 



Yo = Xfc_i 




(9) 


1) = b_i + Af(r;_i)+gAWz_i, 1 = 1,.. 


.,L 


(10) 


II 




(11) 



where the components of AWi are iV(0; A) distributed increments of the stan- 
dard Wiener process, L is some a priori chosen integer. 



4 Simulation-Based Suboptimal MM Filter 



The key idea of Monte Carlo methods is to represent the required posterior 
density functions by a set of random samples with associated weights and to 
compute estimates based on these samples and weights [5]. Respectively, the 
evolving PDFs in eqs. (3) and (4) are replaced by a set of random samples 

i = 1, iv| which are predicted i = 1, iv| and updated 

= l,iv| by the particle algorithm for k = 1,2, .... A bank of r 
independent PFs are run in parallel 

^ |(x^J* *Wa = 1,1V,- 1 ^ j = T^ 

where N = common number of particles. The weights of the 

particular filters are used to obtain model-conditioned estimates. The weights of 
all N particles take part in the calculation of the output combined estimate and 
posterior model probabilities. The algorithm is described as follows: 



1. Initialization . k = 0. 

* For i= 1, . . . ,N , sample Xq*^ ~ p(xo) and set fc = 1. 

2. For j = 1, . . . ,r (possibly in parallel) execute 



* Prediction step 

for i = 1, . . . ,N j generate samples 

(xJ)*(d (x, tfe|zi,fe_i) according to the Euler scheme. 

* Measurement processing step 



on receipt of a new measurement : 
for i = 1, 






evaluate the normalized weights 

Ai=iP(zf=l(xi)*‘'b) 

* Selection step 



resample with replacement Nj particles ((x;(,)(*); i = 1,. . . ,Nj) 

from the set ((x;(,)*(*); i = 1, . . . , Nj) according to the importance weights 
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* Compute model-conditioned estimates 

~ Nj 



3 . 



4. 



End for j 

Output: Compute combined output estimate and model posterior probabil- 
ities 



Set k < — k + 1 and go to step 2. 



EJ=iEriiK)« 



The suboptimality of the procedure is due to the finite number of samples used. 
As the sample size tends to infinity, the optimal results can be achieved. 



5 Simulation Results 



BO scenario. BO trajectories for two drag parameters a\ = 1/50 • 10^ and 
Q !2 = 1/1 • 10^ are shown in Fig. 1. The true initial parameters xq = 280 km, 
yo = 88 km, xq = —2255.2 m/s, yo = —397.65 m/s, g = 9.81 m/s^ are chosen 
like [6]. 

Measurement model. Tracking radar usually provides measurements in a spher- 
ical coordinate system. The measurement vector Zg = [r e]' , comprising range 
and elevation measurements, has a nonlinear relationship with the object state 
variables. The form of the measurement function in eq. (2) is as follows: h(x) = 



+ tan ^ ^ . The error standard deviations of range and elevation 
angle measurements are <7^ = 100 m and de = 0.15 respectively. A mea- 
surement conversion is implemented from spherical (r , e) to Cartesian (x , y) 
coordinates: z^ = [rcos(e) rsin(e)]k Thus the measurement function becomes 
linear. The components of the corresponding covariance matrix R are as follows 
[ 2 ]: 

^(e) -I- sin^(e); 



dj = d^ cos 



9 . 2 

= at. sin 



(e) 



r^d? cos^ 



(e); 



Oxy = (ct/ - r^d/) sin(e) cos(e). 

The time interval between measurements is T = 1 s. 

Parameters of MM scheme. A bank of r = 2 filters is implemented for DPs 
Q!i = 1/50 • 10^ and 02 = 1/1 • 10^, identical with the scenario parameters. The 
prior probability of each model being correct is selected as ^ 1 , 2 ( 0 ) = 0.5. The 
equal number of particles fVi ^2 = 5000 is chosen for each filter. 

Filter design parameters. The parameters of state vector initial distribution 
xq ~ N[xo;mo,Po] are selected as follows: mo contains the exact initial BO 
parameters and Pq = diag {200^ m, 30.0^m/s, 200^m, 30. O^m/s}. The following 
parameters of the Euler scheme are accepted: prediction integration step is A = 
1; L = 10; process noise variance: qx = 0.09 and qy = 0.01. The sample size is 



N = 10000. 



Measures of performance. The state estimation accuracy is evaluated in terms of 
mean absolute error (MAE) ~ the absolute value of the actual error, averaged over 
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runs. Position MAE (both coordinates combined) and speed (magnitude of the 
velocity vector) MAE are calculated in the simulation experiments. The average 
probability of eorreet model identifieation is a measure for correct recognition of 
the DP. 





Fig. 1. BO Scenario: Trajectory (a) and (b) Speed versus time 





Fig. 2. Position MAE (a) and (b) Speed MAE versus time 



Simulation results. As it can be seen from Fig.l, trajectories with different drag 
parameters are close to each other in a rather long time interval. The differences 
become obvious after the 50-th scan, when the speeds rapidly distinguish one 
from the other. This fact complicates the task of tracking. 

The tracking filter performance is evaluated based on a trajectory with ex = 
a\. The results are obtained by averaging over 50 Monte Carlo runs. Position 
and speed MAEs of the combined estimate and the “best” filter (filter which 
parameter is tuned to a = a\) estimate are shown in Fig. 2. The “best” filter 
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Fig. 3. Posterior model probabilities (a) and (b) PDF histograms p(* 7 q|zi: 7 o), j = 
1,2 



estimate is better during the transitional time interval of identifying DP (between 
50-th and 70-th scans), which is due to the fairly large posterior probability of 
the “wrong” filter. The averaged (over the runs) error standard deviation of 

the measurements -I- (also shown in Fig. 2(a)) can serve as a 

quality measure of the accuracy, achieved by the filter. The curve of the position 
MAE is below in the whole tracking interval. Plots of the posterior model 
probabilities are given in Fig. 3(a). As it can be seen the delay in correct model 
detection is no more than 15 or 20 scans. The DP identification is accomplished 
with a high degree of confidence. PDF histograms p(s 7 q|zi: 7 o) for j = 1,2 (x- 
position) are presented in Fig. 3(b). The posterior PDF of the appropriate model 
{a = ai) is located above the true object position, while the PDF of the “wrong” 
model is far from the desired place, determined mainly from the solution of the 
SDE with a = 02 - 

6 Conclusion 

Monte Carlo-based filter is suggested in the paper for tracking a reentry ballistic 
object by a radar. The objective is to explore the capability of particle algo- 
rithms for precise and reliable estimation of such highly nonlinear systems. 

The feature of the proposed version is connected with the SDE, describing object 
dynamics. A numerical solution of SDE is realized in order to refine the prob- 
ability density of the predicted state and consequently to improve the overall 
estimation accuracy. MM approach is applied also for identifying the unknown 
drag parameter. 

The filter performance is examined by simulation over a typical BO scenario. 
The results show a reliable tracking and correct drag parameter determination. 
It gives an alternative solution to the difficult and important problem of BO 
tracking. 



120 Donka Angelova, Iliyana Simeonova, and Tzvetan Semerdjiev 



References 

1. Amlampalam, S., Masked, S., Gordon, N., Clapp, T.: A Tutorial on Particle Fil- 
ters for Online Nonlinear/Non-Gaussian Bayesian Tracking. IEEE Trans. Signal 
Processing, Vol. 50 (2002) 174-188. 

2. Bar-Shalom, Y., Li, X.R.: Multitarget-Multisensor Tracking: Principles and Tech- 
niques. YBS Publishing, (1995) 

3. Cardillo, G., Mrstik, A., Plambeck, T.: A Track Filter for Reentry Objects with 
Uncertain Drag. IEEE Trans. AES Vol. 35 (1999) 394-409. 

4. Costa, P.: Adaptive Model Architecture and Extended Kalman-Bucy Filters. IEEE 
Trans. AES Vol. 30 (1994) 525-533. 

5. Doucet, A., de Frietas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in 
Practice. Statistics for Engineering and Information Science. Springer- Verlag, New 
York (2001) 

6. Farina, A., Ristic, B., Benvenuti, D.: Tracking a Ballistic Target: Comparison of 
Several Nonlinear Filters. IEEE Trans. AES Vol. 38 (2002) 854-867. 

7. Farina, A., Benvenuti, D., Ristic, B.: Estimation accuracy of a landing point of a 
ballistic target. Proc. FUSION 2002 Conf., Annapolis, USA, (2002) 2-9. 

8. Jilkov, V., Semerdjiev, Tz., Angelova, D: Monte Carlo Filtering Algorithm for Non- 
linear Stochastic Environmental Model. Recent Advances in Numerical Methods 
and Applications II. World Scientific, London (1998) 266-274. 

9. Kastella, K.: Finite Difference Methods for Nonlinear Filtering and Automatic 
Target Recognition. In: Bar-Shalom, Y., Blair, W. (eds.): Multitargrt-Multisensor 
Tracking. Applications and Advances, Vol. 3. Artech House, Boston (2000) 233- 
258. 

10. Kloeden, P., Platen, E., Schurz, H.: Numerical Solution of SDE Through Computer 
Experiments. Springer- Verlag, New York (1994). 

11. Li, X., Jilkov V.: A Survey of Maneuvering Target Tracking - Part ILBallistic 
Target Models. Proc. SPIE Conf. Signal and Data Processing of Small Targets, 
Vol. 4473-63, San Diego, CA, USA (2001) 

12. Siouris, G., Chen, G., Wang, J.: Tracking an Incoming Ballistic Missile Using an 
Extended Interval Kalman Filter. IEEE Trans. AES Vol. 33 (1997) 232-240. 

13. Tanizaki, H.: Nonlinear Filters. Estimation and Applications. Springer- Verlag 
(1996) 



Efficient CPU-Specific Algorithm 
for Generating the Generalized Faure Sequences* 



Emanouil I. Atanassov 



Central Laboratory for Parallel Processing, Bulgarian Academy of Sciences 
Acad. G. Bontchev, Bl. 25A, 1113 Sofia, Bulgaria 
emanouilSparallel . bas . bg 



Abstract. The Faure sequences are a popular class of low-discrepancy 
sequences. Their generalized variants, with better equi-distribution 
properties, are extensively used in quasi-Monte Carlo methods, espe- 
cially for very high dimensional problems. The task of generating these 
sequences can take substantial part of the overall CPU time of a quasi- 
Monte Carlo computation. 

We present an efficient algorithm for generating these sequences, and 
demonstrate how it may be tuned to use the extended instruction sets, 
available on many modern CPUs, to reduce drastically the CPU-time, 
spent for generating these sequences. 

1 Introduction 

1.1 Ideas of Monte Carlo and Quasi- Monte Carlo Methods 

The main idea in the Monte Carlo methods is to represent the unknown quan- 
tity, r], as the mathematical expectation of some random variable so that by 
averaging over N independent samples of ^ one can obtain an estimate of rj. Usu- 
ally the random variable ^ is expressed as a function / of s independent random 
variables, uniformly distributed in [0, 1]. Thus the function / is defined in the 
s-dimensional unit cube E®, and the Monte Carlo algorithm for approximating 
77 can be thought of as an algorithm for approximate calculation of the integral 



where the values are computed using a pseudo-random number generator. 
The integer s becomes the constructive dimension of the Monte Carlo algorithm 
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by means of the formula: 
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(see, e.g, [10], p. 254). Such an algorithm has an order of convergence of the 
probable error 0{N^). In order to obtain faster convergence for suitable classes 
of functions, the idea of the quasi-Monte Carlo methods is introduced. Instead 
of random numbers, one can use a deterministic s-dimensional uniformly dis- 
tributed sequence a = (a;j)^g, so that the average is computed using the values 
of / at the first N terms of the sequence a. The quality of distribution of the se- 
quence a is usually measured by its discrepancy Dn{g) (for a definition see, e.g., 
[5]). It is believed that the best order of the discrepancy of an infinite sequence 
is 0{N~^log^ N) (see the results of W.M. Schmid [9], K.F. Roth [8], Baker [1]), 
which yields an asymptotic order of the integration error, better than that of 
Monte Carlo. Infinite sequences, which achieve such order of convergence to zero 
of the discrepancy, are called “low-discrepancy” sequences. Various theoretical 
and practical results show that even for vary large s the low-discrepancy se- 
quences may outperform Monte Carlo on classes of functions, important for the 
applications (see, e.g., the work of Paskov and Traub [7], and of A. Papageorgiou 
and J. Traub [6]). The question of finding the best low-discrepancy sequence re- 
mains unsolved. Many classes of low-discrepancy sequences are known, and for 
different problems different classes perform better. One important class of low- 
discrepancy sequences are the sequences introduced by Faure. The generalized 
Faure sequences, proposed by Tezuka [11-13] and sometimes called GFaure, are 
widely used in practice. The difference from point of view of generation algo- 
rithms between the general and the particular case is small, and we decided to 
deal only with the general case. 

1.2 Definitions of the Faure and Generalized Faure Sequences 

Tezuka [11-13] introduced a generalization of the Faure sequences, adding more 
degrees of freedom, which permit the construction of better distributed se- 
quences. We use the following definition (Faure [4]): 

Definition 1. Let p be a fixed integer, greater than or equal to the dimension 
s. Let = ( “ 1, . . . , s be given infinite non- singular lower triangular 

matrices, and let P^, i = 1, s be the powers of the Pascal matrix, so that 

pW = 1)”-A ( modp). 

WJJ / r>0,j>0 

The nth term of the s-dimensional generalized Faure sequence a a is obtained by 
expanding n in p-adic number system: 

m 

3=0 



= '^\P ^ ^ mod 






and then setting 
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where by mod {z^p) we denote the remainder of z modulo p. If the matrices 
are all equal to the identity matrix, we obtain the original Faure sequences. 



These sequences are of low-discrepancy, and moreover, they are the first class of 
sequences, for which an estimate of this kind was obtained (Faure [3]): 



with a constant Cg tending to zeros as s tends to infinity. This result suggests 
that these sequences should be good for practical purposes, and in the area 
of financial mathematics they become very popular for estimating some very 
high-dimensional integrals. They are used for instance in the so-called FinDer 
program [2], which is polupar in Financial Mathematics. Various constructions 
for the matrices are known. We assume that they are given to us as input. 

1.3 Importance of CPU-Specific Generation Algorithms 

Because of their similarity from algorithmic point of view, the quasi-Monte Carlo 
and classical Monte Carlo methods compete directly both in terms of precision 
and speed of computation. The main difference in these setting is that in the 
case of classical Monte Carlo we use a pseudo-random number generator, and in 
quasi-Monte Carlo methods we need a suitable generator function for the chosen 
low-discrepancy sequence. Obviously, for some applications the function / will 
be complicated and expensive to evaluate in terms of computational time, while 
for others the main part of the CPU time will be spent in the generation. 

In most quasi-Monte Carlo algorithms one can use almost any of the popular 
low-discrepancy sequences and compare the results, analogously to the situation 
in Monte Carlo, where one is advised to test different pseudo-random number 
generators. This suggests that it is preferable to have generation functions with 
similar calling conventions for every popular class of low-discrepancy sequences. 
However, we aim to develop the most efficient implementation possible for a 
given computer architecture. Since there is a certain degree of fine-grained par- 
allelism in the operations, required for generating the terms of the sequence, an 
important speed-up can be obtained by using the special instructions for process- 
ing multi-media data, which are available on most modern CPU. In this paper 
we concentrate on the most popular such instruction sets. The MMX instruction 
set was the first extension to the 386 instruction set, introduced by Intel. It was 
later implemented on the newer AMD processors like Athlon and Duron, and 
therefore it can be considered as standard for the newer Pentium-compatible 
processors. The SSEl and especially the SSE2 instruction set, introduced by 
Intel, are more powerful, but are available only on the newest processors. We 
show how our algorithm can be implemented using the MMX and the SSE2 in- 
struction set. Since the Pentium and Athlon processors are very widespread at 
both workstations and clusters, the speed-up obtained by using our algorithm 
will be available to a large part of the scientific community. Writing a version of 
the generation code for another instruction set is just a matter of implementing 
some simple operations on large vectors of numbers. 
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2 Description of the New Generation Algorithm 



We developed our algorithm under the assumption, that the terms of the se- 
quence need to be generated consecutively. In such case it is justified to have a 
preprocessing phase, so that each term of the sequence could be easily generated 
using the preprocessed data and some data from the previous invocation of the 
generation function. The data is organized so as to make maximal use of the 
extended instructions available, which means the array describe below should be 
at 8-byte aligned memory locations. At the preprocessing stage we perform the 
following operations: 



1. Input the dimensionality, s, of the sequence, the prime number p > s, the 
coefficients of the matrices A^'^\ and the index of the first term to be 
generated, Nq. 

2. Compute the coefficients of the matrices 

3. Compute the “skewed” coefficients, by the formula 




(d 

8 

^jk 






if j = 0; 
if J > 1- 



4. Compute the digits bj j = 0, . . . , 64, of TVq in p-adic positional number 
system. 

5. Fill the table of relative positions: 

( 64 
k=0 

6. Compute the table of partial sums, defined as follows: 







n—k 



i = 0, . . . , 63. 



Note that we need at most the first 64 rows and columns of the matrices Aid 
and Cid, since we work in double precision. The partial sums are, in fact, 
the coordinates of the Noth term of the sequence. Observe that all these steps 
have complexity comparable to that of step 2. 

When the generation function is invoked for generating the Nth term of the 
sequence, we first compute the digits bj of TV -I- 1 and copy the partial sums 
to the buffer, supplied by the user, since they represent the Nth term of the 
sequence. Then we proceed in two possible ways, depending on the index, k, of 
the first least-significant non-zero p-adic digit of If fc = 0, we perform the 
following operations: 

_i (i) 

1. For i = l,...,s add dooP to the partial sum Sy , and if the result is 
greater than 1, subtract 1 from it; 

2. For i = l,...,s add dpg to every relative position t®, and if it becomes 
greater than p, subtract p from it. 
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The key point in the algorithm is to perform these operations using the extended 
instruction sets available, without any conditional jumps, which would degrade 
the performance. This is explained in more detail in the next section. If fc > 1, 
we perform the following operations: 

1. For j = k to 0 with step —1, do the following: 

(a) For every i G 1, . . . , s, set to if if becomes greater than 

p, subtract p from it. 

(b) For every i G {1, . . . , s} store in 

Obviously, the probability that k will be greater than 1 is p~^, and therefore 
the way we handle this second case is not as crucial for the performance of the 
algorithm as the case fc = 0. We can vectorize the operations, that are performed 
here, in a similar manner to the previous case, but in this case the speed-up is 
not as dramatic. In the next section we give more details about the generation 
code that we wrote based on this algorithm. Note that we do obtain the sequence 
in its natural order, without any Gray code reordering. 

3 Implementation Details 

As it can be seen from the description of the algorithm, the most crucial part in 
terms of CPU time is the way we handle the case when the index, N, of the term 
of the sequence, is not divisible by the prime p. That is why when we implement 
this algorithm on a given computer architecture, we need to write CPU-specific 
code only for this part, even though the case fc > 1 can be handled in a similar 
way. Since the operations that need to be vectorized are relatively simple, we 
successfully wrote variants of the generation code for the MMX and the SSE2 
instruction sets. We always worked in double precision, but it should not be a 
problem to write faster single precision versions. Note that the MMX instruction 
set is less powerful and does not allow the floating point operations in double 
precision to be vectorized. Yet we obtained good speed-up, as it can be seen in 
Table 1. 

Table 1. CPU Time and speed measurements for generating ten million terms of the 
sequence in 100 dimensions on various types of computer architectures 



Manufacturer 


CPU 


Frequency 

(MHz) 


Instruction 

set 


CPU time 


Speed 


Intel 


Pentium IV 


2200 


none 


12.9s 


84 


Intel 


Pentium IV 


2200 


SSE2 


3.9s 


254 


AMD 


Athlon XP 


1662 


none 


13.5 


74.3 


AMD 


Athlon XP 


1662 


MMX 


8.2s 


121 



The vectorization is straight-forward, basically equivalent to manual loop- 
unrolling. The code is available under the GNU Public license at: 
http : //parallel . bas . bg/'ememouil/ sequences . html 



126 



Emanouil I. Atanassov 



Here we only give the main idea. We wish to perform efficiently the oper- 
ations, described above, without using conditional jumps, because they signif- 
icantly degrade the performance. We use the fact, that if we need to compute 
the expression 



a{x,y) 



X if p{x) = 0; 
X + yii p(x) = 1 



we can use the formula 



a(x, y) = X + q(p(x)) xor y, 



where the plus sign denotes integer or floating point summation, and by q(p(x)) 
we denote a sequence of bits, with values equal to the value of the predicate 
p{x), and with width in bytes equal to the width of x and y in bytes. Since our 
predicates are of comparison type, they are vectorized naturally and efficiently. 
As an example, we show how step 1 of the algorithm in the case fc = 0 is 
vectorized using the MMX instruction set (only the innermost loop is shown): 



for (i=0 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

asm 

result+ 

input+ 

} 



; i<dim_rounded_ 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

volatile ( 

= 16 ; 

16 ; 



up_to_f actor_of _16 ;i+=16){ 

"movq "/oO/Zoyomml" : : "m" (result [0] )) ; 

"movq "/oO/ZoyommO" : : "m" (input [0] )) ; 

"paddb "ZommO/Zomml") ; 

"movq yoO,yoyomm2" : : "m" (result [8] )) ; 

"movq "ZoO/ZoyommS" : : "m" (input [8] ) ) ; 

"paddb "ZommS , yomm2 " ) ; 

"movq "Zomml , ynimB " ) ; 

"pcmpgtb yommT/ZommI") ; 

"movq yoinm2 , ynimG " ) ; 

"pcmpgtb ZonimT , yomm2 " ) ; 

"pand "Zoinm? , ymml " ) ; 

"pand "ZommT , yomm2 " ) ; 

"psubb yomml/Zommb") ; 

"movq "Zoyommb , "ZoO " : : "m" (result [0] )) ; 
"psubb yomm2 , "ZommG " ) ; 

"movq yyommG , "ZoO " : : "m" (input [8] ) ) ; 



4 Timing Results 

We present timing results, obtained by compiling our code with the gcc compiler. 
The non-standard instructions are coded using inline assembly statements. In 
our tests we measured the time, required for producing ten million terms of the 
sequence in various dimensions. The generation speed is measured in millions of 
coordinates, generated per one second. We observed that the generation speed 
falls somewhere between that obtained with our generators for the Sobol’ and 
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Halton sequences, which are also available at the above web address. Note that 
an average pseudo-random number generator runs at speed of around 10 millions 
numbers per second, and therefore the use of low-discrepancy sequences offers 
significant performance advantage. We also show the generation time and speed 
for the portable version of our algorithm, which is not restricted to the x86 
architecture, so that the significant speed-up, obtained trough vectorization, can 
be observed. 

5 Conclusions and Directions for Future Work 

With the new vectorized algorithm we are able to achieve significant speed-up 
with respect to regular C or FORTRAN code implementation. At the same time 
porting the algorithm to another architecture would require only trivial changes. 
Our plans are to implement generation codes for some other popular classes of 
low-discrepancy sequences under the same paradigm. 
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Abstract. Bachvalov proved that the optimal order of convergence of a 
Monte Carlo method for numerical integration of functions with bounded 
fcth order derivatives is O 2 j ^ where s is the dimension. We con- 

struct a new Monte Carlo algorithm with such rate of convergence, which 
adapts to the variations of the sub-integral function and gains substan- 
tially in accuracy, when a low- discrepancy sequence is used instead of 
pseudo-random numbers. 

Theoretical estimates of the worst-case error of the method are obtained. 
Experimental results, showing the excellent parallelization properties of 
the algorithm and its applicability to problems of moderately high di- 
mension, are also presented. 



1 Introduction 

The paper proposes new quasi-Monte Carlo algorithm for integrating smooth 
functions. We consider integration over the unit cube E® = [0,1)®. The class of 
functions under consideration is the following: 

Definition 1. For given integers s and k, s > 1 and k > 0 the class (M, E®) 
consists of all real functions defined in E®, such that all the derivatives 

d^f 

dx\^ . . . dxl‘ 

exist for all i\ + ■ ■ ■ + is = r < k and there absolute values are bounded by M . 

The results of Bachvalov [2,3] establish lower bounds on the integration er- 
ror of both deterministic and stochastic or Monte Carlo methods. Various Monte 
Carlo methods for approximate integration of such functions with the optimal 
order of convergence O are known (see, e.g. [1,5,10,20]). In this pa- 

per we investigate from theoretical and practical view point a quasi-Monte Carlo 
algorithm, based loosely on the same ideas. The idea of quasi-Monte Carlo algo- 
rithms is to replace the pseudo-random numbers, used in Monte Carlo, with a 
deterministic sequence of points, uniformly distributed in some high-dimensional 
unit cube E® . The quality of distribution of these sequences is measured through 
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which is believed to be the optimal, are called low-discrepancy sequences. Note 
that in Monte Carlo methods one can frequently obtain a statistical error esti- 
mate along with the result, which is necessary in many areas. In quasi-Monte 
Carlo methods a similar error estimate of statistical nature can be obtained using 
the so-called scrambling of sequences, which is a way of adding some randomness 
to otherwise entirely deterministic sequences. The most popular quasi-Monte 
Carlo integration method is based on the simple formula: 



where cr = is a uniformly distributed sequence. For a good introduction 

in the theory of uniform distribution modulo 1 see, e.g., the books [9] and [7]. 
More complex methods are developed and tested by many authors (see, e.g., [18, 
16]). A method, based on scrambled nets, can be seen in [12]. In quasi-Monte 
Carlo methods it is important to achieve low constructive dimensionality (for a 
definition, see [17], p. 255). Since the first few coordinates of a low-discrepancy 
sequence are better distributed, various ways to adjust the integration procedure 
to the properties of the integrand are proposed (see, e.g., the Brownian bridge 
construction in [4]). 

The performance of the quasi-Monte Carlo method depends on the quality 
of the distribution of the underlying low-discrepancy sequence. Various families 
of such sequences are known, and we tested some of the most popular ones in 
our algorithm. The algorithm is described in Sec. 2. Numerical results, showing 
the applicability of the method and its excellent parallelization properties, are 
given in Sec. 3. 

2 Description of the Algorithm 

Since we are going to describe our algorithm using the term pseudo-inverse 
matrix, we provide the following 

Definition 2. (see e.g. [11], p. 257) Let A he an m x n real matrix with rank 
r. Suppose that U'^ AV = E is the SVD of A. Then the pseudo inverse matrix 
of A is defined as A+ = VE C/^, where 



is referred to as the pseudo inverse of A. If rank{A) = n, then A+ = 



(A^A)-iA^. 

Definition 3. Let t he an integer, t>l, and let a\, . . . ,at he fixed points in E®. 
Let f € W^{M, E®) for some M . The s-tuples (ii, . . . , is) with ii~\-- ■ --|-is = r < k 
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can he ordered lexicographically and there are exactly of them. Consider 

the matrix B with t rows and columns, such that the column of B 

corresponds to the tuple {ii, . . . ,is) and contains all the products 

bj {ii,...,is)= , for j = l,...,t. 

Suppose that t > and that B has rank Let the matrix C he the 

pseudo-inverse of B. C has rows and t columns. Now the interpolation 

polynomial L (/, x) is defined by 

t 

L{f,x) ='^f{aj) Cj{ii,...,is)x\C..xl‘ 

i=l (*l, ■■■,*«), ill his<fe 

Observe that if the points ai, . . . ,as are in general position, the matrix B has 
rank Since in our definition B has always full rank, the pseudo-inverse 

C is equal to {B{B^ B)~^f^ . The reason for using formulae with t > for 

numerical integration is that the coefficients of the matrix C are much smaller in 
this case. Of all the matrices satisfying BAB = B the pseudo-inverse is chosen 
because its Frobenius norm |j . \\p is the smallest. 

Now let us fix some integers k and s, greater than 1 and consider functions 
/ G {M, E®) defined over the unit cube E®. We also fix the number t and the 
points {oi, . . . ,at} C E®, such that an interpolation formula L{f){x) = L{f,x) 
as described in Definition 3 is defined. Observe that by integrating over E® we 
get a quadrature formula 



where 



f f (a::) dx 
E" 



t 

i=i 



E 

his<fe 



Cj (tj , . . . , is) 

(*i + 1) • ■ ■ {is + 1) 



Definition 4. Consider a rectangular area 7F C E® and an interpolation for- 
mula L \ f —i L{f) as in definition 3. Let T he a linear transformation such that 
T{K) = E®. For a given function f : K ^ R consider the function gr : E® — > i? 
such that g{y) = f{Tx). The interpolation formula Lp : f Lx{f) is defined 
by Lx{f,x) = L{g,Tx). 



Now we define a Monte Carlo integration formula, which we are going to 
investigate in the sequel. 

Definition 5. Let N > 1 he an integer. Consider a representation of the unit 
cube E® as a union of N rectangles K\, . . . , Kx ■ 
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For the rectangle Ki = n [bj, Cj) we consider the simplest linear transfor- 
mation T such that T{Ki) = E®; 




By T ^ the points {ai,...,at} are transformed into points . . . , | C 

Ki. Let m be a given integer, m > 2, and let be N independent 

random variables, uniformly distributed in the respective rectangles K\, . . . ,K]\[. 
For every Ki we use m samples of and define the Monte Carlo 

estimate of J f (x) dx to be 

Ki 



vol {Ki) 



I] /(“?) + 



1 



1 

m 



m 

E/(fr) 




The Monte Carlo estimate for the integral f f {x) dx is obtained by summing 

W 

all the estimates for the integrals J f (x) dx. 

Ki 

Such integration formulae were considered in [1] , but only when t has the small- 
est possible value This additional degree of freedom results in faster 

convergence of the resulting formulae. 



3 Numerical Results 

In this section we present numerical results that are obtained for calculating the 
value of the integral 

4 = y Fk{x)dx, k=l,...,7. 



The functions, that are considered here, are with different peculiarities. Often 
they are used for benchmarking of Monte Carlo and quasi-Monte Carlo algo- 
rithms. They are given as follows: 




^2 = EH (-1)^4’ 

i=l j=l 



Fs = exp(-x^)cos(|a:|). 
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F4 = exp{-x‘^)y/{l + x'^), 
F5 = cos 27 TMi + ^ QiXi , 



-fg — I ^ ^ ^ 



-(s+1) 



Fr = exp - ^ a^{x^ - UiY 
V Z=1 

The first function Fi has been used in [8] and [14] as a test in a context of 
rating point sets for quasi-Monte Carlo and Monte Carlo integration. The last 
three functions are a part of the test functions package proposed by Genz or in 
[13,15,19]. The parameters ui,Ui,ai are divided into unaffective and affective 
parameters. The vector a of affective parameters is scaled so that it satisfies the 
requirement 

||a||i = 110s“5^ 

||a||i = 600 s- 2 , 

||a||i = lOOs-i 

for the functions F^,Fq, Ft respectively, that depends only on the space dimen- 
sionality. 

In Table 1 we present numerical results from the approximate integration 
of the integral Ii using the Monte Carlo and quasi-Monte Carlo version of the 
algorithm. The algorithms where tested for dimension 4, 5, 6, and 9. We did not 
compare our algorithm with crude Monte Carlo as in [6], because of its much 
slower convergence. From this table we made the conclusion, that the quasi- 
Monte Carlo variant of the algorithm shows approximately the same accuracy 
as the Monte Carlo algorithm, i.e., for smooth functions. 




Table 1. Results for li, with dimension s, number of steps N and number of points 
per cube 40. 



N 


r 


s = 4 


s = 5 


s = 6 


s = 9 






MC 


S. 


MC 


S. 


MC 


S. 


MC 


S. 


3 


4 


2.69E-5 


1.52E-5 


2.65E-5 


1.71E-5 


2.43E-5 


2.53E-5 


1.34E-5 


9.46E-6 




6 


4.59E-7 


2.61E-7 


1.05E-6 


7.07E-7 


2.62E-7 


2.31E-7 


1.57E-6 


4.83E-7 


4 


4 


5.59E-6 


2.84E-6 


2.68E-7 


2.78E-6 


2.33E-6 


2.65E-6 


1.36E-6 


1.74E-6 




6 


7.06E-8 


3.07E-8 


4.99E-8 


7.15E-8 


5.84E-8 


4.11E-8 


- 


- 


5 


4 


1.43E-6 


1.15E-6 


7.51E-7 


5.41E-7 


5.15E-7 


5.67E-7 


- 


- 




6 


5.30E-9 


3.57E-9 


1.47E-8 


1.08E-8 


8.19E-9 


8.39E-9 


- 


- 
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Table 2. Numerical results for 12,1a, Ia with dimension s = 5. 



N 


r 


h 


h 


h 






MC 


S. 


MC 


S. 


MC 


S. 


3 


4 


1.07E-6 


1.86E-6 


1.29E-6 


2.50E-6 


7.93E-7 


1.74E-6 




6 


4.07E-16 


8.15E-16 


2.52E-8 


3.37E-8 


1.89E-8 


2.45E-8 


4 


4 


1.86E-7 


3.05E-7 


4.28E-7 


1.86E-7 


2.91E-7 


1.19E-7 




6 


4.12E-16 


4.28E-16 


2.55E-9 


3.52E-9 


1.91E-9 


2.51E-9 


5 


4 


3.54E-8 


5.70E-8 


9.47E-8 


6.62E-8 


6.92E-8 


4.85E-8 




6 


5.19E-16 


5.19E-16 


4.96E-10 


7.74E-10 


3.72E-10 


5.75E-10 



Table 3. Numerical results for /s, /e, h with dimension s = 5. 



N 


r 


h 


h 


h 






MC 


S. 


MC 


S. 


MC 


S. 


3 


4 


6.44E-3 


5.89E-3 


3.06E-8 


4.61E-8 


1.93E-5 


4.37E-5 




6 


9.65E-3 


8.12E-3 


1.02E-7 


3.56E-7 


6.24E-5 


7.02E-5 


4 


4 


5.05E-3 


5.45E-3 


3.01E-8 


3.05E-8 


5.33E-6 


8.57E-6 




6 


4.51E-3 


2.74E-3 


6.04E-8 


l.OlE-7 


1.76E-5 


1.41E-5 


5 


4 


1.84E-3 


3.89E-3 


7.68E-9 


1.09E-8 


5.12E-6 


4.25E-6 




6 


3.09E-3 


2.06E-3 


2.80E-8 


1.51E-8 


4.26E-6 


3.96E-6 



Table 4. CPU time and efficiency for Fa, dimension s = 5, smoothness 4, number of 
steps 20, number of points per cube 40. 



NP 


1 


2 


4 


8 


16 


32 


Method 


time Eff. 


Time Eff. 


Time Eff. 


Time Eff. 


Time Eff. 


Time Eff. 


MC 


336 


1 


169 0.99 


83 


1.01 


43 0.98 


24 0.88 


14 


0.75 


Sobol 


337 


1 


170 0.99 


86 


0.98 


45 0.92 


24 0.88 


15 


0.71 



In Table 2 and 3 we show numerical results for the other integrals in 5 di- 
mensions. We again observe roughly the same probable error of the Monte Carlo 
and the quasi-Monte Carlo algorithm. All these results are obtained with the 
parameter m (the number of quasi-random points in every small cube) equal to 
40. In Table 4 we show the parallel efficiency of the algorithm. The computations 
for every small cube can be done in parallel, and thus the algorithm possesses a 
significant amount of natural parallelism. The results are achieved on a cluster 
of Intel Xeon 2.2 processors, using MPI. 

4 Conclusions 

We have developed a quasi-Monte Carlo method, which achieves order of con- 
vergence O ) on smooth functions. The proposed quasi-Monte Carlo 
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method shows roughly the same accuracy as the Monte Carlo version, and also 
exhibits good parallel efficiency. However, the statistical aposteriory error es- 
timation, based on the use of scrambled low-discrepancy sequences, is slightly 
more complicated and less reliable, than Monte Carlo methods. In our exper- 
iments we obtained good results by using the Sobol sequences. Note that the 
constructive dimensionality of the algorithm is relatively high - sm, and thus 
the integration method can be used for assessing the quality of distribution of 
other families of low-discrepancy sequences, i.e., for benchmarking purposes. 
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Abstract. This paper describes Monte Carlo (MC) method for Multi- 
ple Knapsack Problem (MKP). The MKP can be defined as economical 
problem like resource allocation and capital budgeting problems. The 
Ant Colony Optimization (ACO) is a MC method, created to solve Com- 
binatorial Optimization Problems (COPs). The paper proposes a Local 
Search (LC) procedure which can be coupled with the ACO algorithm 
to improve the efficiency of the solving of the MKP. This will provide 
optimal or near optimal solutions for large problems with an accept- 
able amount of computational effort. Computational results have been 
presented to assess the performance of the proposed technique. 



1 Introduction 

There are many NP-hard combinatorial optimization problems for which it is im- 
practical to find an optimal solution. Among them is the MKP. For such problems 
the reasonable way is to look for algorithms that quickly produce near-optimal 
solutions. ACO [1,3,2] is a MC method, created to solve COPs. It is a meta- 
heuristic procedure for quickly and efficiently obtaining high quality solutions to 
complex optimization problems [7]. ACO algorithm can be interpreted as parallel 
replicated Monte Carlo systems [10]. MC systems [8] are general stochastic sim- 
ulation systems, that is, techniques performing repeated sampling experiments 
on the model of the system under consideration by making use of a stochastic 
component in the state sampling and/or transition rules. Experimental results 
are used to update some statistical knowledge about the problem, as well as the 
estimate of the variables the researcher is interested in. In turn, this knowledge 
can be also iteratively used to reduce the variance in the estimation of the de- 
scribed variables, directing the simulation process toward the most interesting 
state space regions. Analogously, in ACO algorithms the ants sample the prob- 
lem’s solution space by repeatedly applying a stochastic decision policy until a 
feasible solution of the considered problem is built. The sampling is realized con- 
currently by a collection of differently instantiated replicas of the same ant type. 
Each ant “experiment” allows to adaptively modify the local statistical knowl- 
edge on the problem structure. The recursive retransmission of such knowledge 
by means of stigmergy determines a reduction in the variance of the whole search 
process the so far most interesting explored transitions probabilistically bias fu- 
ture search, preventing ants to waste resources in not promising regions of the 
search. 
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The AGO algorithms were inspired by the observation of real ant colonies. 
Ants are social insects, that is, insects that live in colonies and whose behav- 
ior is directed more to the survival of the colony as a whole then to that of 
a single individual component of the colony. An important and interesting of 
ant colonies is how ants can find the shortest path between food sources and 
their nest. AGO is the recently developed, population-based approach which has 
been successfully applied to several NP-hard GOPs [4] . One of its main ideas is 
the indirect communication among the individuals of a colony of agents, called 
“artificial” ants, based on an analogy with trails of a chemical substance, called 
pheromone which real ants use for communication. The “artificial” pheromone 
trails are a kind of distributed numerical information which is modified by the 
ants to reflect their experience accumulated while solving a particular problem. 
When construct a solution, at each step ants compute a set of feasible moves 
and select the best according to some probabilistic rules. The transition prob- 
ability is based on the heuristic information and pheromone trail level of the 
move (usability of the move in the past). 

The rest of the paper is organized as follows: Section 2 describes the general 
framework for MKP as a GOP. The search strategy of the local search method 
is explained in Section 3. Section 4 outlines the implemented AGO applied to 
MKP. In Section 5 experimental results over test problems are shown. Finally 
some conclusions are done. 



2 Formulation of the Problem 

The MKP has received wide attention from operation research community, be- 
cause it embraces many practical problems. Applications include resource allo- 
cation in distributed systems, capital budgeting and cutting stock problems. In 
addition, MKP can be seen as a general model for any kind of binary problems 
with positive coefficients [6,5]. The MKP can be thought as a resource alloca- 
tion problem, where we have m resources (the knapsacks) and n objects. The 
object j has a profit pj Each resource has its own budget (knapsack capacity) 
and consumption of resource j by object i. We are interested in maximizing 
the profit, while working with a limited budget. 

The MKP can be formulated as follows: 



subject to < Ci i = 1, . . . , m (1) 

Xj € {0,1} j = l,...,n 

There are m constraints in this problem, so MKP is also called m-dimensional 
knapsack problem. Let / = {1, . . . , m} and J = {!,..., nj, with Ci > 0 for all 
i G I. A well-stated MKP assumes that pj > 0 and < Ci < ^or 

all i G I and j G J. Note that the [rij]mxn matrix and [ci]m vector are both 
non-negative. 
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In the MKP we are not interested in solutions giving a particular order. 
Therefore a partial solution is represented by S' = • ■ • ,ij} and the most 

recent elements incorporated to S, ij need not to be involved in the process 
for selecting the next element. Moreover, solutions for ordering problems have 
a fixed length as we search for a permutation of a known number of elements. 
Solutions for MKP, however, do not have a fixed length. We define the graph of 
the problem as follows: the nodes correspond to the items, the arcs fully connect 
nodes. 

3 The Local Search Strategy 

The LS method (move-by-move method) [9] perturbs a given solution to generate 
different neighborhoods using a move generation mechanism. In general, neigh- 
borhoods for large-sized problems can be much complicated to search. Therefore, 
LS attempts to improve an initial solution by a series of local improving changes. 
A move-generation is a transition from a solution S to another one S" G 1^(<S') 
in one step. These solutions are selected and accepted according to some pre- 
defined criteria. The returned solution S' may not be optimal, but it is the best 
solution in its local neighborhood P(>5'). A local optimal solution is a solution 
with the local maximal possible cost value. Knowledge of a solution space is the 
essential key to more efficient search strategies. These strategies are designed to 
use this prior knowledge and to overcome the complexity of an exhaustive search 
by organizing searches across the alternatives offered by a particular represen- 
tation of the solution. The main purpose of LS implementation is to speed up 
and improve the solutions constructed by the AGO. 

A practical LS strategy that satisfies the MKP requirements has been devel- 
oped and coupled with AGO algorithm. The MKP solution can be represented 
by string with 0 for not chosen items and 1 for chosen items. In our LS procedure 
we will replace one of the 1 with 0 and one of the 0 with 1. The change in cost 
is completed and the swap is accepted or rejected according to the acceptance 
strategy. 

4 AGO Algorithm for MKP 

Real ants foraging for food lay down quantities of pheromone (chemical cues) 
marking the path that they follow. An isolated ant moves essentially at random 
but an ant encountering a previously laid pheromone will detect it and decide to 
follow it with high probability and thereby reinforce it with a further quantity of 
pheromone. The repetition of the above mechanism represents the auto-catalytic 
behavior of real ant colony where the more the ants follow a trail, the more 
attractive that trail becomes. 

The AGO algorithm uses a colony of artificial ants that behave as co-operative 
agents in a mathematic space were they are allowed to search and reinforce 
pathways (solutions) in order to find the optimal ones. After initialization of 
the pheromone trails, ants construct feasible solutions, starting from random 
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nodes, then the pheromone trails are updated. At each step ants compute a 
set of feasible moves and select the best one (according to some probabilistic 
rules) to carry out the rest of the tour. The transition probability is based on 
the heuristic information and pheromone trail level of the move. The higher 
value of the pheromone and the heuristic information, the more profitable is to 
select this move and resume the search. In the beginning, the ant do feasible 
moves and select the best one (according to some probabilistic rules) to carry 
out the rest of the tour. In the beginning, the initial pheromone level is set to a 
small positive constant value tq and then ants update this value after completing 
the construction stage. AGO algorithms adopt different criteria to update the 
pheromone level. In our implementation we use Ant Colony System (ACS) [3] 
approach. 

In ACS the pheromone updating stage consists of: 

4.1 Local Update Stage 

While ants build their solution, at the same time they locally update the pher- 
omone level of the visited paths by applying the local update rule as follows: 



Where Tij is an amount of the pheromone on the arc (i,j) of the graph of the 
problem, p is a persistence of the trail and term (1 — p) can be interpreted as 
trail evaporation. 

The aim of the local updating rule is to make better use of the pheromone 
information by dynamically changing the desirability of edges. Using this rule, 
ants will search in wide neighborhood of the best previous solution. As shown 
in the formula, the pheromone level on the paths is highly related to the value 
of evaporation parameter p. The pheromone level will be reduced and this will 
reduce the chance that the other ants will select the same solution and conse- 
quently the search will be more diversified. 

4.2 Global Updating Stage 

When all ants have completed their solution, the pheromone level is updated 
by applying the global updating rule only on the paths that belong to the best 
solution since the beginning of the trail as follows: 



Tij ^ (1 - p)nj -I- pro 



( 2 ) 



! 



nj ^ (1 - p)nj + Arij 

Lgb if (i,j) G best solution 



( 3 ) 



where = 



0 otherwise 



Lgb is the cost (profit) of the best solution from the beginning. This global 
updating rule is intended to provide a greater amount of pheromone on the 
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paths of the best solution, thus intensifying the search around this solution. The 
transition probability to select the next item is given as: 






probiJt) = 



if j G allowedk{t) 



0 otherwise 



( 4 ) 



where Tij is a pheromone level on the arc (*, j), ?7y is the heuristic and allowedkit) 
is the set of remaining feasible items. Thus the higher the value of and rjij, 
the more profitable it is to include item j in the partial solution. 

Let Sj = heuristic information we use: 



T]ij = < (5) 

I Pf if Sj = 0 

where di > 0 and d 2 > 0 are parameters. Hence the objects with greater profit 
and less average expenses will be more desirable. 

5 Experimental Results 

This section reports on the computational experience of the ACS coupled with 
local search procedures using large MKP from “OR-Library” available within 
WWW access at http://mscmga.ms.ic.ac.uk/jeb/orlib. To provide a fair com- 
parison for the above implemented ACS algorithm, a predefined number of iter- 
ations, K=500, is fixed for all the runs. The developed technique has been coded 
on C -I- -I- language and implemented on a Pentium 3 (450 MHz) . Because of the 
random start of the ants we can use less ants then the number of the nodes (ob- 
jects). After the tests we found that 10 ants are enough to achieve good results. 
Thus we decrease the running time of the program. The results on the Figures 
show the advantage to use local search procedures. The reported results are an 
average over 10 runs of the program. The chosen LS strategy appears to be very 
effective for MKP. For all tested problems our algorithm achieved, in some of 
the runs, the best found results in the literature. In all tested problems achieved 
results using LS procedure are better than without local search. 

6 Conclusion 

In this paper a local search procedure have been developed. The comparison 
of the performance of the ACS coupled with this procedure applied to differ- 
ent MKP problems are reported. The results obtained are encouraging and the 
ability of the developed technique to generate rapidly high-quality solutions for 
MKP can be seen. For future work another important direction for current re- 
search is to try different strategies to explore the search space more effectively 
and provide good results. 
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Fig. 1. The figures show the average solution quality per iteration over 10 runs 
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Fig. 2. The figures show the average solution quality per iteration over 10 runs 
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Abstract. In this paper we describe and study importance separation 
Monte Carlo method for integral eqnations. Based on known results for 
integrals, we extend this method for solving integral equations. The 
method combines the idea of separation of the domain into uniformly 
small subdomains (adaptive technique) with the Kahn approach of im- 
portance sampling. We analyze the error and compare the results with 
the crude Monte Carlo method. 

1 Introduction 

An important advantage of Monte Carlo methods (MCMs) is that they allow to 
find directly the unknown functional of the solution of integral equations with a 
number of operations necessary to calculate the solution of an integral equation 
only in one point of the domain [11]. Much of the efforts to improve MCM are 
in construction of variance reduction methods which speed up the computation 
(see [2]). In [6,8,9] such method for numerical integration called importance 
separation (IS) is presented and studied. This method has the best possible rate 
of convergence for a certain class of functions (see [1]), but its disadvantage is 
that it gives better accuracy only for low dimensions. 

The problem for calculation of a linear functional of the solution of an integral 
equation can be transformed into approximate evaluation of a finite number of 
integrals (linear functionals of iterative functions) [6]. We apply the IS method 
for each of the integrals and extend the study of its properties. We also study the 
conditions under which the IS method has an advantage over other methods. We 
compare these results with Crude Monte Carlo method (CrMCM) for integrals 
and Monte Carlo method for integral equations (MCM-int.eq.) (see [II]). 

2 Formulation of the Problem 

Consider the Fredholm integral equation of the second kind: 




or 



u = ICu + f (/C is an integral operator), where 
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k{x,x') € L2{f2 X 17), f{x) € L2{f2) are given functions and u{x) € L 2 {fi) is an 
unknown function, x,x' £ fi d (17 is a bounded domain). 

We are interested in Monte Carlo method for evaluation of linear functionals 
of the solution of the following type: 

J{u) = / ip{x)u{x)dx = {(p,u). (1) 

J n 

It is assumed that ip{x) G L 2 {fi). We can apply successive approximation method 
for solving integral equations: 



uW = / + + + + * = 1,2,... ( 2 ) 

j=o 

where u^^^x) = f{x). It is known that the condition ||/C||l 2 < 1 is a sufficient 
condition for convergence of the Neumann series. Thus, when this condition is 
satisfied, the following statement holds: 

— > u as i ^ oo. 



Therefore, 



J{u) = = lim (:^, = lim [ | = lim 'S^ (if , f\ . 

i—^oo i — KXD \ ‘ ^ I i — K30 ‘ ^ V / 

\ i=0 / 3=0 

An approximation of the unknown value {f, u) can be obtained using a truncated 
Neumann series (2) for sufficiently large i: 

{f, uW) = (^, /) + {f^jc !) + ... + {if, /c«/). 

So, we transform the problem for solving integral equations into a problem for 
approximate evaluation of a finite number of multidimensional integrals. We will 
use the following denotation {ip,JC^^'>f) = /(j), where I{j) is a value, obtained 
after integration over 17-*+^ = 17 x . . . x 17, j = 0,..., I. It is obvious that 
the calculation of the estimate {f, u^*^) can be replaced by evaluation of a sum 
of linear functionals of iterative functions of the following type = 

0, . . . , I, which can be presented as: 



(V?,/CW/) 



f{to)dto = 



f{to)k{to,h) . . . k{tj_i,tj)f{tj)dto . ..dtj, 



( 3 ) 



where t = (ffi, ■ • ■ , tj) G G = 17-*+^ C If we denote by F{t) the integrand 

function 



F{t) = ip{to)k{ta,ti) . . .k{tj_i,tj)f{tj), te I7^+\ 
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then we will obtain the following expression for (3): 

I{j) = /) = / F{t)dt, teGc . (4) 

JG 

So, from now on we will consider the problem for approximate calculation of 
multiple integrals of the type (4). We will first review briefly the most widely 
used Monte Carlo methods for integrals and integral equations. It is well-known 
that Monte Carlo methods reduce the problem to the approximate calculation of 
mathematical expectation which coincides with the unknown functional defined 

by (1). 

3 Monte Carlo Methods 

3.1 Crude Monte Carlo Method for Integrals 

The Monte Carlo quadrature formula is based on the probabilistic interpretation 
of an integral. If {x^}, m = 1, . . . , N is &n uniformly distributed sequence in G, 
then the Monte Carlo approximation to the integral I{j) is 

m - GU) = ^ E 



with integration error £Ar(j) = |/(j) — In { j)\ ~ y Var^F) ^ V(G) is the 

volume of G. 



3.2 Monte Carlo Method for Integral Equations 

It is known (see [II]) that the approximate calculation of linear functionals of 
the solution of an integral equation ((^, u) brings to the calculation of a finite 
sum of linear functionals of iterative functions ((p, JC^ f),j = 0, . . . , i. First, in 12 
we construct a random trajectory (Markov chain) Ti of length i starting in state 
xq: 

Ti : Xq — > xi — > . . . — > x^ 

according to the initial tt(x) and transition p(x,x') probabilities. The functions 
7t(x) and p{x,x') satisfy the requirements for non-negativeness, to be accept- 
able to function (p{x) and the kernel k{x,x') respectively and f^7r(x)dx = 1, 

f^p(x, x')dx' = 1 for any x € 12 C From the above supposition on the 
kernel k{x,x') and the well-known fact that 



E0J [</?] = (</>,Mb)), where 9i[p] 



t{xq) 

tt{xo) 



^Wy-/(a;j) 



and Wq = 1, 



W, = 



k(xj-i,Xj) 

p(Xj-i,Xj)’ 



j = 
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it follows that the corresponding Monte Carlo estimation of (<p, is: 

1 ^ 
n—1 

Therefore, the random variable Oi[Lp\ can be considered as an estimate of the de- 
sired value (</j, u) for i sufficiently large with a statistical error of order 
where N is the number of chains and value of 0i[(f] taken over the 

n-th chain. 

It is very important that the same trajectories of the type can be used 
for approximate evaluation of (:/?, for various functions (f{x). Furthermore, 
they can be used for various integral equations with the same kernel k{x,x'), 
but with different right-hand side f{x). 



3.3 Importance Separation for Integrals 

The importance separation is a Monte Carlo method which combines the idea of 
separation of the domain of integration into uniformly small subdomains (strati- 
fication, [5]) and the Kahn approach to implement more samples in those subdo- 
mains where the integrand is large (importance sampling for integrals, [7], and 
for integrals equations, [4, 10]). This method has the best rate of convergence for 
the class of functions with bounded derivatives. 

One approach how to make a partition of the given domain into subdo- 
mains was studied in [9] where the problem for evaluation of the integral I(j) = 
Jq F{t)dt is considered. The suggested there partition scheme of the domain 
G = [a; h] into M subintervals (one-dimensional case) is the following one: 

M 

G=|JGz, Gi = [xi-i,xi], /=1,...,M-1, 



Ci = + F{xm)]{xm - Xi-i), z = 1, . . . , M - 1, 



Xi = Xi-i + 



G,; 



F(x,_i)(M-i + l)’ 
It is known (see [11]) that 



xq = a, xm = b. 



E0^(j) = m, 



where 



M 






Ni 

2=1 ^ Z = 1 



( 5 ) 



and is a random point in the z-th subdomain of G. 
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In the general case of multidimensional integrals (G C M") the following 
integration error (the probable error) holds [9] : 



rN <V2 n 



N 



N 



X](iiCpC2.)^ 



i=l 



7V-5-^ 



M = N, 



( 6 ) 



where n is the dimension of the domain of integration, M is the number of sub- 
domains, the integrand is a positive function F{t), which belongs to G). 

This means that Fit) is continuous on G with partially continuous first deriva- 
tives and 

"F. Fii , I — t ^ Gi^ Li — (-^ii ? ■ ■ • 5 ^id ) ; — max . 



dF 

dti 



The constants Cl^{i = and the vectors of constants 02 ^ € are 

determined from the requirement the subdomains Gi , z = 1 , . . . , M have to be 
uniformly small in probability and in geometrical size, and it is also assumed 
that C 2 = max C 2 ■ . 

' i 

From (6) it is clear that the error of the importance separation method which 
has the order asymptotically goes to for large dimen- 

sions n. This estimation of integration error shows that importance separation 
can be considered as a good method for approximate calculation of integrals only 
if n is not very large. Therefore when we translate this conclusion in the terms of 
integral equation, it means that the von Neumann series has to converge quickly. 



4 Description of the Algorithm 

In the application of importance separation for multidimensional integrals vari- 
ous approaches can be used. If G = [a; 6]" C M , we use the following simplified 
scheme: 



— the number of subintervals and the points of partition along all axis directions 
are the same; 

— all components of any point of the partition are identical by reason of using 
lo = (a, . . . , a) € G and the formula 



i^i)l — {^i—l)l F 



G, 

F{x,-i){M -i + iy 



I = 1, . . . , n 



for each of components of the next point Xi G G. 



An essential step of the importance separation realization is the partition 
into subdomains. The idea, subdomains to be of identical measure (volume), is 
used as a base for obtaining the points of partition. We will demonstrate some 
requirements imposed on the integrand for one-dimensional case. The first and 
the last point of the partition are fixed - they coincide with the bounds of the 
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considered domain (interval) . The following equation is used to obtain the point 

Xi, 1 : 



/ 



F{t)dt = {M 



1) r F{t)dt. 



The trapezoid rule is applied to the integral at the left-hand side of the above 
equation and the left rectangle rule at the right-hand side. This approach of 
partition cannot be used for functions which vanish in some regions because of 
using the value of integrand at the left endpoint of the interval. On the other 
hand, the formula (5) will be applicable if the following inequality is fulfilled: 

Q 

~r? VTFt ^ i TT — 

-i + l) 



This inequality can be transformed to a condition for the number of subinter- 
vals M: 






F{xm) 

2F(x,_i)’ 






(7) 



It is obvious that M will become too large when F{xi-i) is close to zero for 
some i € (1,M — 1). Like it is mentioned before, the importance separation 
method contains the idea to generate more samples in the regions where the 
integrand has larger values. So, IS is appropriate to integrands that have regions 
of different significance. On the other hand the formula (7) shows that if the 
ratio of the integrand values at the right endpoint and at some interior point 
of the interval is large, M will increase too. For example, if in multidimensional 
case the number of subintervals on each axis direction is 10 and n = 6 (the 
dimension of the domain of integration) one has to perform 10® realizations of 
the random variable. The method cannot be realized in fewer samples because 
it is necessary to pick up at least one point in every subdomain. 



5 Error Analysis 

We consider the problem for approximate calculation of linear functional of the 
solution of an integral equation (if, u). Let us denote the i-th iterative approxi- 
mation of u(i > 0) by and the Monte Carlo approximation by u. It is known 
that two kinds of error, systematic (a truncation error) and stochastic (a proba- 
bilistic one), are characteristic for iterative Monte Carlo methods. If £ is a given 
sufficiently small positive parameter then 

|(V3,m) - (ip,u)\ < \(f,u) - (ip,u^'^^)\ + \(f,u^"^) - (i^,m)| =Ei + eN <e, 

where Si is the truncation error, en is the probabilistic error. We can obtain a 
lower bound for the number i of iterations using [3] : 
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where is the initial approximation of u. Using the Cauchy-Bunyakowski 
inequality it is easy to show that 



The second multiplier can be estimated: 

< (I|/||l. + \\JC\\l. + • ■ ■ + mi, + • ■ .)lk(°)|U.(r2), 

where = f — — K.u^^\ j = 0, 1, . . . . 

Taking the limit for i — > oo one can obtain 



- m||l, < 



|t.(o) I 



L2(r3) 



1-ii^iu. ■ 

It holds if the condition ||/C||l 2 < 1 is fulfilled. 



6 Numerical Experiments 

We present the numerical results (accuracy and CPU-time) for the considered 
algorithm, importance separation, applied to calculate a linear functional of the 
solution of an integral equation. We use the following integral equation as a test 
example: 

u{x) = / k{x,x')u{x')dx' + f{x), where 

J n 

n 055 

fc(x,x') = y^I^ + 0.07, (||/C|U,«0.2) (8) 

f{x) = 0.02(3x2 -k C = [-2; 2]. 

This kind of equation describes some neuron networks procedures. We are inter- 
ested in an approximate calculation of {tp, u), where ip{x) = 0.7((x-|-l)2 cos(5x)-|- 
20). The results are presented as a function of the number of iterations (length 
of Markov chain) and N, the number of samples (only one sample of N random 
points is used for the presented results). The importance separation algorithm 
is constructed so that only one sample of the random variable is chosen in every 
subdomain. The number of iterations d is fixed, but it has been chosen in advance 
according to the L 2 -norm of the kernel (8). For the approximate computation of 
any integral /(j), j = 0, . . . ,i different number of samples are used in order to 
have error balancing. The results for importance separation are compared with 
CrMCM and MCM-int.eq. and they are presented in Table 2 and Table 1. It is 
known that the importance separation is a variance reduction method. Table 2 
illustrates this property with the fact that the importance separation method 
leads to smaller relative error in comparison with CrMCM for fixed number of 
samples. From the estimation (6) it is seen that the acceleration of the conver- 
gence for large d is insignificant. The CPU-time (see Table 1) for importance 
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Table 1. Absolute and relative error in the approximate calculation of {^p,u) using 
Crude MCM for integrals, MCM for integral equations and Importance separation for 
fixed number of iterations i. 



i 


ei 


N 


CrMCM 


IS 


MCM-int.eq. 






No 


Ni 


N2 


No 


N4 


Ns 


Abs. 


Rel. 


Abs. 


Rel. 


N 


Abs. 


Rel. 


1 


1.3907 


10 


8^ 


- 


- 


- 


- 


2.8315 


0.3151 


1.2821 


0.1427 


360 


1.2995 


0.1446 


2 


0.5424 


10 


8^ 


5"* 


- 


- 


- 


1.8733 


0.2085 


0.5489 


0.0611 


540 


0.5537 


0.0616 


3 


0.2115 


100 


8^ 


5'^ 


8^^ 


- 


- 


0.2180 


0.0243 


0.0497 


0.0055 


4096 


0.3072 


0.0342 


4 


0.0825 


10 


8^ 


5^ 


8-i 


4b 


- 


1.4077 


0.1567 


0.0717 


0.0079 


11600 


0.0832 


0.0093 


5 


0.0322 


100 


kF 


1(F 


1(F 


5^ 


6*^ 


0.1282 


0.0843 


0.0119 


0.0013 


37000 


0.0329 


0.0037 



Table 2. Relative error and CPU-time (in seconds) in the approximate calculation 
of {ip, u) using Crude MCM for integrals, MCM for integral equations and Impor- 
tance separation. The number of samples used for calculation of each of the integrals 
j = 0, . . . , 5 is denoted by Nj. 



N 


CrMCM 


IS 


MCM-int.eq. 


No 


Ni 


N2 


No 


Na 


Ns 


Rel. error 


Time 


Rel. error 


Time 


N 


Rel. error 


Time 


10 


8^ 


6'=' 


s'* 


4" 


6*> 


0.1553 


0.39 


0.0144 


0.39 


100 


0.0363 


0.02 


100 


W 


rF 


rF 


5*" 


6*’ 


0.0843 


0.44 


0.0013 


0.45 


1000 


0.0144 


0.16 


100 


w 


18^ 


s'* 


5^" 


6*= 


0.0343 


0.43 


0.0048 


0.43 


10000 


0.0084 


1.55 


100 


w 


w 


!¥ 




6*= 


0.0279 


2.12 


0.0069 


2.12 


25000 


0.0036 


3.87 


100 


w 


TcF 


rF 


1¥ 


6*> 


0.0228 


2.04 


0.0014 


2.06 


6'= 


0.0025 


7.17 


500 


w 


W 




12^ 


7b 


0.0019 


3.75 


0.0049 


3.81 


7° 


0.0041 


18.47 


100 


w 


18^ 


lU 


12^ 


S*' 


0.0233 


2.67 


0.0033 


2.71 


8° 


0.0032 


39.91 


100 


w 


rF 


rF 


5*" 


rF 


0.0142 


7.84 


0.0012 


8.14 




0.0034 


145.93 



separation exceeds slightly the CPU-time for crude MCM because of the addi- 
tional number of calculations necessary for the partition although the number 
of samples is the same for both methods. The CPU-time of the MCM-int.eq. is 
several times larger than the corresponding CPU-time of the IS method. The 
reason for that is the use of the acceptance-rejection method for modeling us- 
ing initial probability density (in general case MCM-int.eq. needs modeling with 
transition probability densities too). However, when the inverse function method 
is used for simulation, the CPU-time of IS can exceed several times the CPU- 
time of MCM-int.eq. The reason is that one has to use a larger number of calls of 
pseudorandom generator for IS method (the random points belong to 
than for MCM-int.eq. (the random points belong to M^). 



7 Conclusion 

Importance separation gives better accuracy than CrMCM and MCM-int.eq. 
when the von Neumann series converges rapidly, i.e. when the number of itera- 
tions i and the dimension n = d{j -1-1), j = 0, . . . , i of the integrals I{j) are not 
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very large. CrMCM can have priority in accuracy over IS when the dimension 
increases. This can be explained by the fact that for large dimension n the error 
of IS method asymptotically goes to which corresponds to the error 

of CrMC method. 
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Abstract. We study a parallel Monte Carlo (MC) method for inves- 
tigation of a quantum kinetic equation which accounts for the action 
of the electric field during the process of electron-phonon interaction. 
Optimization of the presented parallel algorithm is done using variance 
reduction techniques and parallel random sequences from the Scalable 
Parallel Random Number Generator (SPRNG) library. The developed 
code written in C is parallelized with MPI and OpenMP codes. 
Numerical results for the parallel efficiency of the algorithm are obtained. 
The dependence of the electron energy distribution on applied electric 
field is investigated for long evolution times. The distribution function is 
computed for GaAs material parameters. 



1 Introduction 

The development and application of the MC methods for quantum transport 
in semiconductors and semiconductor devices has been initiated during the last 
decade [1,2]. The stochastic approach relies on the numerical MC theory as 
applied to the integral form of the generalized electron-phonon Wigner equation. 
An equation for reduced electron-phonon Wigner function which accounts for 
electron-phonon interaction has been recently derived [3]. 

For a bulk semiconductor with an applied electric field the equation resem- 
bles the Levinson equation [4], or equivalently the Barker-Ferry (B-F) equation 
[5] with infinite electron lifetime. A crude MC method has been proposed to 
find quantum solutions up to 200 femtoseconds (fs) evolution times of the B-F 
equation at zero temperature [6]. It is proved [7] that stochastic error has order 
0(exp(ct)/A^2 where t is the evolution time, N is the number of samples of 
the MC estimator, and c is a constant depending on the kernel of the quantum 
kinetic equation. This estimate shows that when t is fixed and N ^ oo the error 
decreases, but for N fixed and t large the factor for the error looks ominous. 
Therefore, the problem of estimating the electron energy distribution function 
for long evolution times with small stochastic error requires combining both MC 
variance reduction techniques and distributed or parallel computations. 

* Supported by the European Gommission through grant number HPRI-GT-1999- 
00026 and by Genter of Excellence BIS-21 grant IGAl-2000-70016, as well as by the 
NSF of Bulgaria through grant number 1-1201/02. 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 153-161, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In this paper a parallel MC method for solving the B-F equation is studied. 
The transition density function for the Markov chain is chosen to be propor- 
tional to the contribution from the kernels. The integral parts of the kernels 
are estimated using MC integration. An importance sampling technique is in- 
troduced to reduce the variance in the MC quadrature. A new rule for sampling 
the transition density function of the Markov chain is used to construct the MC 
estimator. In this way, we avoid the acceptance-rejection techniques used [6]. 
The parallelisation of the MC algorithm is done using MPI and OpenMP codes. 
All these improvements in the MC approach lead to a decrease of the compu- 
tational complexity of the algorithm and allow to estimate the electron energy 
distribution function for long evolution times. 



2 The Quantum Kinetic Equation 



The quantum kinetic equation accounting for the electron-phonon interaction in 
presence of applied electric field can be written in the following integral form [6] : 

f{k,t) = cl){k)+ [ dt” [ dk'AT(k,k')x (1) 

Jo Jg 

dt'Si (k, k', F, t', t")f{k', t") + dt'S2{k, k', F, t', t")/(k, f 



where the kernel is separated in two terms: 



i^(k,k') = 



2V 

2TT^ffl 



|g(q)P, and 



( 2 ) 



S'i(k,k',F,t',t") = -5'2(k',k,F,t',t") = exp{-r{f -t")) x 

' f e{k) - e{k') + hwq , h 

(riq -I- 1) cos ' 



( 3 ) 



q U' - t") - _(k' - k) • F(t'2 - 
2m 



+ Uq cos 



- A(k' - k) . - n 

^ 2m 



Here, k and t are the momentum and the evolution time, respectively, /(k, f) 
is the distribution function, (^(k) is the initial electron distribution function. 
F = eElh, where E is the applied electric field. Uq = 1 / {exp{Hu;q/ ICT) — 1) is 
the Bose function, where JC is the Boltzmann constant and T is the temperature 
of the crystal, corresponds to an equilibrium distributed phonon bath. Hcoq is the 
phonon energy which generally depends on q = k' — k, and e(k) = {H^k^)/2m 
is the electron energy. A Frohlich coupling is considered 

1 
2 



5(q) = -* 



2'Ke‘^llUJq ( 1 



V 



C oo 



1 

C s 



(q)^ 



where (coo) and (cg) are the optical and static dielectric constants. The damp- 
ing factor r is considered independent of the electron states k and k'. This is 
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reasonable since F weakly depends on k and k' for states in the energy region 
above the phonon threshold, where the majority of the electrons reside due to 
the action of the electric field. 

3 Monte Carlo Approach 

Consider the problem for evaluating the following functional 

.h{f)={hj)= r [ h{k,t)f{k,t)d^kdt 

Jo Jg 

by a MC method. Suppose the distribution function /(k, f) and the arbitrary 
function h{k,t) belong to any Banach space X and the adjoint space X* , re- 
spectively. The wave vector k belongs to a finite domain G G and t G (0,T). 
The value of / at a fixed point (ko,to) is provided by the special case h(k,t) = 
(5(k— ko)(5(t— to)- Since the Neumann series of the integral equation (3) converges 
[7] the solution /(ko, to) can be evaluated by a MC method. 

Define a terminated Markov chain (ko,to) (kj,tj) —f . . . 

,tmej, such that (kj,tj) G Gx(0,T) as t^ e (0,tj_i), j = {ei 

is the truncation parameter). All points are sampled using an arbitrary transi- 
tion density function p(k, k', t, t") which is tolerant^ of both kernels in equation 
( 1 ). The biased backward MC estimator for the solution of (1) at the fixed point 
(ko,to) has the following form: 



U., [ko, to] = <^(ko) + ^ VE“<A(k“) 
i=i 



( 4 ) 



where 



</'(k“) 



4>{kj), if a = 1 
())(kj_i), if a = 2 , 






-^(kj-i, kj)t^g(kj-i, kj, tj-i, tj) 
PctP(kj— 1 , kj tj ) 



j = 



= 1 . 



Here r'a(kj-i,kj,tj-i,tj) is a MC estimator for ^ dt' Saikj-i,kj,'F,t' ,tj-\) 

in the j-th transition at the Markov chain. The probabilities Po,, (<a = 1,2) are 
chosen to be proportional to the absolute values of the kernels. 

Now we can define a Monte Carlo method 



1 

fV 



N 

^ ] ('^T71e, [ko, to])i 

i=l 



^ Jsifm.J « /(ko,to). 



( 5 ) 



P 

where — > means stochastic convergence as TV — *■ oo; is the iterative solution 

obtained by the Neumann series of (1). The relation (5) still does not determine 



^ p{x) is tolerant of g{x) if p(x) > 0 when g{x) 7 ^ 0 and p{x) > 0 when g{x) = 0 . 
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the computational algorithm. The sampling rule, which compute the next point 
at the Markov chain, has to be specified by using random number generators. 

In order to avoid the singularity in (2) the following transition density func- 
tion is suggested: p(k, k', t, t") = p(k'/k)p(t, t"), where = 1/t. In spher- 
ical coordinates the function p(k'/k) is chosen in the following way: 

p(k'/k) = (47r)“^(p)“^Z(a;)“^, where w = (k' — k)/p, p = |k' — k| and Z(w) 
is distance in the direction of the unit vector uj from k to the boundary of the 
domain G. If G is a sphere with radius Q, the function p(k'/k) satisfies the 
condition for a transition density. Indeed, 

[ p{k'/k)d^k'= (f{ATr)-^duj \r'f-'^l{u})-^dr' = l. 

Jg J Jo 

Suppose the direction of the field E parallel to the kz-axis. Then the unit vector 
UJ can be sampled in the x^-plane (or sin (/? = 0) because of the symmetry of the 
task around the direction of the field. 

Thus, if we know the wave vector k the next state k' can be computed by 
the following sample rule: 

Algorithm 1: 

1. Sample a random unit vector uj = (sin 0, 0, cos 0) as sin0 = sin27r/3i and 

COS0 = cos27t/3i, where f3i is an uniformly distributed number in (0, 1); 

2. Calculate l{uj) = —uj ■ k-k {Q^ + [uj ■ k)^ — k^)^, where uj ■ k means a scalar 

product between two vectors; 

3. Sample p = l[uj)P 2 , where /?2 is an uniformly distributed number in (0, 1); 

4. Calculate k' = k -k puj. 

In order to evaluate /(k, t) in the fixed point (ko,to))j ^ random walks of the 
MC estimator can be computed by the following algorithm: 

Algorithm 2: 



1. Choose a positive small number ei and set initial values k := ko,t := to, 
^ := <(.(k),W := 1; 

2. Sample a value t" with a density function p[t,t") = 1/t; 

3. Sample the next state k' using Algorithm 1; 

4. Sample Ni independent random values of t' with a density function 






rexp[—r[t' — t")) 

1 — exp[~r[t — t”)) ’ 



5. Calculate 






1 Sa[k,k' 



and Pa 



\vi\+ \U 2 \" 



a = 1,2; 
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6. Choose a value /3, uniformly distributed random variable in (0, 1); 
If (/3 < pi) then 



W :=W 



7C(k, k') ti^i 
pip(k'/k) 



^:=^ + Wcf>(k'), k:=k'; 



else 



W:=W 



K(k,k')tiy2 

P2p(k'/k) 



^:=^ + Wcf>(k); 



7. Set t := t" and repeat from step 2 until t < £i; 

8. Repeat N times steps 1 — 7 and estimate the electron energy distribution 
function by Eq.(5). 



4 Parallel Implementation and Numerical Results 

The computational complexity of the obtained Algorithm 2 can be measured by 
the quantity F = N x t x E{rne^). The number of the random walks, TV, and the 
average number of transitions in the Markov chain, if(mej), are connected with 
stochastic and systematic errors [7]. The mean time for modeling one transition, 
T, depends on the complexity of the transition density functions and on the 
sampling rule, as well as on the choice of the random number generator (rng) . 

It is well known that the MC algorithms are very convenient for parallel 
implementations on parallel computer systems [8], because every realization of 
the MC estimator can be done independently and simultaneously. Although 
MC algorithms are well suited to parallel computation, there are a number of 
potential problems. The available computers can run at different speeds; they 
can have different user loads on them; one or more of them can be down; the 
rng’s that they use can run at different speeds; ets. On the other hand, these 
rng’s must produce independent and non-overlapping random sequences. Thus, 
the parallel realization of the MC algorithms is not a trivial process on different 
parallel computer systems. 

In our research, the Algorithm 2 has been implemented in C and has been 
parallelized using an MPI code. The numerical tests have been performed on 
Sunfire 6800 SMP system with twenty-four 750 MHz UltraSPARC-Ill proces- 
sors located at the Edinburgh Parallel Computer Centre (EPCC). The SPRNG 
library has been used to produce independent and non-overlapping random se- 
quences [9] . Our aim is to estimate the electron energy distribution function for 
evolution times greater than 200 fs. That is way, our parallel implementation 
with n processors includes the following strategy. A master-slave model is used 
where the master processor delegates work {N/n random walks) to the other 
processors. The slave processors complete the required work and obtain “local” 
MC estimates. After that they return the results back to the master processor 
which produces the “global” MC estimate. Such a parallel strategy is expected 
to give linear speed-up and high parallel efficiency. The results in Table 1 con- 
firm this assumption in general. In addition, an MPI/OpenMP mixed code has 
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been developed for parallel implementation of the Algorithm 2. Such mixed 
mode code has been proposed in [10] as a more efficient parallelisation strat- 
egy to perform diffusion Monte Carlo calculations for an SMP (Share Memory 
Programming) cluster. 



Table 1. The CPU time (seconds) for all 96 points, the speed-up, and the parallel 
efficiency for various combination of OpenMP threads and MPI processes. The number 
of random walks is N = 9600. The electric field is 0 kV/cm and the evolution time is 
100 fs. 



Processes 
X Threads 


CPU 

Time 


Speed-up 


Parallel Processes 
Efficiency x Threads 


CPU 

Time 


Speed-up Parallel 
Efficiency 


1x1 


567.177 






1x1 


567.177 






1x2 


565.955 


1.00216 


0.5011 


2x1 


287.505 


1.97276 


0.9864 


1x4 


524.357 


1.08167 


0.2704 


4x1 


144.540 


3.92401 


0.9810 


1x6 


460.623 


1.23133 


0.2052 


6x1 


96.321 


5.88840 


0.9814 


1x8 


426.134 


1.33098 


0.1664 


8x1 


74.465 


7.61669 


0.9521 


2x2 


310.443 


1.82699 


0.4567 


2x2 


310.443 


1.82699 


0.4567 


2x3 


292.286 


1.94049 


0.3234 


3x2 


211.674 


2.67948 


0.4466 


2x4 


274.871 


2.06343 


0.2579 


4x2 


160.933 


3.54094 


0.4426 



The results in Table 1 demonstrate that this style of programming is not 
always the most efficient mechanism on SMP systems and cannot be regarded 
as the ideal programming model for all codes. 

The numerical results discussed in Figures 1-4 are obtained for zero tem- 
perature and GaAs material parameters: the electron effective mass is 0.063, 
the optimal phonon energy is 36meV, the static and optical dielectric constants 
are £« = 10.92 and £oo = 12.9. The initial condition at t = 0 is given by a 
function which is Gaussian in energy, {(j){k) = exp{—{bik'^ — 62)^), = 96 and 

62 = 24), scaled in a way to ensure, that the peak value is equal to unity. A 
value Q = 66 X 10^m“^ is chosen for a radius of integration domain G. The 
solution /(0, 0, kz, t) is estimated in 2 x 96 points that are symmetrically located 
on 2-axes, the direction of applied field. The truncation parameter e = 0.001. 
The quantity presented on the y-ax.es in all figures is jkj * /(0, 0, t), i.e., it 

is proportional to the distribution function multiplied by the density of states. 
It is given in arbitrary units. The quantity k^, given on the x-axes in units of 
10^^/m^, is proportional to the electron energy. 

The results for the computational cost of the Algorithm 2 are obtained and 
are compared with the algorithm from [6]. To obtain a smooth solution in the 
case when t = 200 /s, the Algorithm 2 needs approximately 9 minutes per 
point on one processor while the algorithm presented in [6] needs approximately 
30 minutes. These results and the parallelisation of the Algorithm 2 allow to 
estimate the solution of (1) with high accuracy for 300 fs evolution time. 

The results for the electron energy distribution are presented on Figures 1-4 
for 200/s and 300/s evolution times. The relaxation leads to a time-dependent 
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Fig. 1. Solutions |k|/(0, 0, fcz, t) versus |kpl0^'*m at positive direction on the z- 
axis, and t = 200 /s. The electric field is 0, & kV/cm, and 12 kV/ cm and the number 
of random walks per point is 1 million. 




|k|x|k| 



Fig. 2. Solutions |k|/(0, 0, fcz, t) versus |k|^10^'*m at positive direction on the z- 
axis, and t = 300 /s. The electric field is 0, QkV/cm, and 12 kV/ cm and the number 
of random walks per point is 24 millions. 
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Fig. 3. Solutions |k|/(0, 0, ^ 2 , t) versus |k|^10^'^m at negative direction on the «- 
axis, and t = 200 /s. The electric field is 0, QkVIcm, and 12kV/cm and the number 
of random walks per point is 1 million. 




Fig. 4. Solutions |k|/(0, 0, fcz, t) versus Ikl^lO^'^m at negative direction on the «- 
axis, and t = 300 /s. The electric field is 0, QkVIcm, and 12kV/cm and the number 
of random walks per point is 24 millions. 
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broadening of the replicas. The presented solutions on Figures 1 and 2 are along 
the electric field and the replicas are shifted to the right by the increasing electric 
field. Figures 3 and 4 show the solutions in direction opposite to the field and 
the replicas are shifted to the left. Also, we see on Figures 2 and 4 that second 
peak is appeared on the left of the initial condition and the replicas begin to 
shift at the same way with the increase of the electric field. The solution in 
the classically forbidden region (see Figures 3 and 4), on the right of the initial 
condition, demonstrates enhancement of the electron population with the growth 
of electric field. The numerical results show that the intra-collisional field effect 
is well demonstrated for 300/s evolution time of the electron-phonon relaxation. 

Conclusions. A parallel MC algorithm for solving the B-F equation in presence of 
applied electric field is presented. A new transition density for the Markov chain 
and an algorithm described the sample rule are suggested. The MC algorithm 
has low complexity in comparison with the algorithm from [6]. MPI/OpenMP 
mixed mode code is developed and is compared with pure MPI performance. 
The numerical results show that the pure MPI performance is preferable for 
large-scale MC simulations in order to investigate the quantum kinetic equation 
under consideration. 
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Abstract. The “random walk on the boundary” Monte Carlo method 
has been successfully used for solving boundary- value problems. This 
method has significant advantages when compared to random walks on 
spheres, balls or a grid, when solving exterior problems, or when solv- 
ing a problem at an arbitrary number of points using a single random 
walk. In this paper we study the properties of the method when we use 
quasirandom sequences instead of pseudorandom numbers to construct 
the walks on the boundary. Theoretical estimates of the convergence rate 
are given and numerical experiments are presented in an attempt to con- 
firm the convergence results. The numerical results show that for “walk 
on the boundary” quasirandom sequences provide a slight improvement 
over ordinary Monte Carlo. 



1 Introduction 

The “random walk on the boundary” Monte Carlo method has been successfully 
used for solving various boundary- value problems of mathematical physics. This 
method is based on classical potential theory which makes it possible to con- 
vert the original problem (boundary-value problem) into an equivalent problem 
(boundary integral equation). The Monte Carlo technique is then used to solve 
numericaly the integral equation. This method was first published in [13], and 
then applied to solving various boundary- value problems. 

The major drawback of the conventional Monte Carlo approach is the statistical 
rate of convergence: computational error behaves as 0(/V“^/^), where N is the 
number of random walks. Computer simulation of randomness is usually based on 
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the generation of a sequence of standard pseudorandom numbers that mimic the 
theoretical behavior of “real” random numbers. One possible way to improve the 
convergence is to change the type of random numbers used. Quasi-Monte Carlo 
methods (QMCMs) use quasirandom (also known as low-discrepancy) sequences 
instead of pseudorandom numbers. The quasi-Monte Carlo method for integra- 
tion in s-dimensions has a convergence rate of approximately 0{{log NY N~^). 
While numerical integration is the major application area of quasirandom se- 
quences, there are many other problems where quasi-Monte Carlo methods also 
give better results than the Monte Carlo methods. There are many papers on 
using quasirandom numbers on differential equations. The curious reader should 
consult, for example, [10, 3, 12, 8, 9]. 

Here, we recall some basic concepts of QMCMs, [1]. First, for a sequence of N 
points {xn} in the s-dimensional half-open unit cube J® define 

Rn{J) = & J}~ m{J) 

where J is a rectangular set and m{J) is its volume. Then define two discrepan- 
cies 

Dn = snp \Rn(J)\, T’w = sup |i?Ar(J)|, 

JeE JeE* 

where E is the set of all rectangular subsets in J® and E* is the set of all 
rectangular subsets in /® with one vertex at the origin. 

The basis for analyzing QMC quadrature error is the Koksma-Hlawka inequality: 

Theorem (Koksma-Hlawka, [1]): For any sequence {xn} and any function / of 
bounded variation (in the Hardy-Krause sense), the integration error is bounded 
as follows 

n—l 



< v{m. ( 1 ) 



The star discrepancy of a point set of N truly random numbers in one dimension 
is 0(A^“^/^(loglogiV)^/^), while the discrepancy of N quasirandom numbers 
in s dimensions can be as low as 0{N~^{\ogNY~Y (see, e.g. [1,11]). Most 
notably there are the constructions of Hammersley, Halton, Sobol, Faure, and 
Niederreiter for producing quasirandom numbers. Description of these can be 
found, for example, in Niederreiter’s monograph [11]. 

In this paper, we present quasirandom walks on the boundary for solving some 
boundary-value problems. The paper is organized as follows. The formulation 
of the problems are given in §2. The random walk on the boundary method is 
described in §3. In §4 the use of quasirandom sequences is discussed, and some 
numerical results are presented. 
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2 Formulation of the Problem 



Consider the problem of computing the electrostatic potential, u, in the interior 
of a compact dielectric G surrounded by another dielectric media. Let p(x) be 
the density of this charge distribution. Then we have 

Au{x) = — , a; e G (or a; € \ G) , (2) 

and u{x) satisfies the continuity conditions on the boundary of the domain: 

«.(!/)=«.(»), = J/eaC. (3) 

Here, Ui and Ue are the limit values of the solution from inside and outside 
respectively, and €e are the corresponding constant dielectric permittivities. 

With the assumption that dG is smooth enough and there are only point 
charges inside G, qmi'm = 1, . . . ,M, it is possible to represent the solution in 
the form [5,4] 

/’ll 

u{x)=g{x)+ — I rp.{y)da{y) = g{x) +uo{x) , (4) 

JdG 27 t ja:- yj 

M ^ 

where g{x) = > — r, and Xm are the positions of the point charges. 

^ 47r£j \x- Xm,\ 

Taking into account boundary conditions (3) and discontinuity properties 
of the single-layer potential’s normal derivative [5], we arrive at the integral 
equation for the unknown density, y: 

y = -Ao/C/i-k / , (5) 



which is valid almost everywhere on dG. Here, Aq = — and the kernel 



of the integral operator K, is 



1 cos 4>y 



-, where 4>vv' is the angle between the 



external normal vector n{y) and y — y' . The free term of this equation equals 
dg 

Aq ... . , and it can be computed analytically. Since Aq < 1, the Neumann series 

dn[y) 

for (5) converges (see, e.g. [5,4]), and it is possible to calculate the solution as 



OO 

“o(a:) = ^(ha,, (-Ao/C)* /) , (6) 

i=0 



where hx{y) = — i r. Usually, however, te ^ U and, hence, ||Ai| — 1| = 

27t |x - yj 

^ — <C 1. Here, Ai = — I/Aq is the smallest characteristic value of the operator 

Ce + Cj 

— Aq/C. This means that convergence in (6) is rather slow. 
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The situation is even worse when we consider a grounded body, G. In this 
case the original problem reduces to an internal Dirichlet problem for the Laplace 
equation: 



Auo = 0 , 




( 7 ) 



The double-layer potential representation of its solution uq{x) = {fi*,h*) leads 
to the following integral equation for the unknown density, /r* : 



. 



( 8 ) 



Here, hl(y) = 



1 cos 



-, the operator JC* is the adjoint of JC, and has the same 



2tt \y - x\ 

characteristic values (see [5,4]). This means that A]] = — 1 and the Neumann 
series for solving ( 8 ) does not converge. 

To speed up the convergence in ( 6 ), and to calculate the solution of ( 8 ), we 
apply the method of spectral parameter substitution (see, e.g., [7], and [14,15] 
for Monte Carlo algorithms based on this method) . This means that we consider 
the parameterized equation = X{—XoJC)y\ + f and analytically continue its 
solution given by the Neumann series for ]Aj < ]Ai j. This goal can be achieved by 
substituting in A its analytical expression in terms of another complex parameter, 
77 , and representing yx as a series in powers of rj. 

2|Ai|?7 _ 



In this particular case, it is possible to use the substitution A = 
xiv)j E^nd hence 



1-77 



«o(x) = t\-XoT (/i., JCf) + 0(g"+i) , (9) 

2=0 



where q = 



11 

1 4-^jXd 3 ’ ^ ^ C"Ii(2lAil)*gL The rate, g, of geometric 

j=i 

convergence of the transformed series in powers of 77 at the point 770 = x“^(l) is 
determined by the ratio of [ 770 ] and L = mini lx~^(Ai)|. Here, Xi are characteristic 
values of /C (and 1C*), and L = 1 [5,7]. 

Given a desired computational accuracy, we can calculate the number of 
terms needed in (9). Thus, the problem reduces to computing a finite number of 
multidimensional integrals. 

The same representation is valid for the solution of the Dirichlet problem (7): 



«0(x) = ^z'"^*(-l)*(/C*V,/i:)+O(C+') • (10) 

2=0 



Here, = 1/3, and (see [15]). 

j=i 
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3 Random Walks on the Boundary 



To construct Monte Carlo estimates for uq{x), it is sufficient to calculate the 
integral functionals /^(x) = (hx,IC^f) and /*(x) = of iterations of 

the reciprocally adjoint integral operators. Here, the domain of integration is 

[aG]*+h 

Let G be a convex domain. In this case, the kernel, k{y, y'), of K, corresponds 
to the uniform in a solid angle distribution of the point y as viewed from the 
point j/'. This means that the most convenient way to implement the random 
walk on the boundary algorithm here is to use direct estimates for /^(x) and 
adjoint estimates for I*{x) [15]. Therefore 

7,(x) = EQ,hx{y^) , I*{x) = EQ*My,) , (11) 



where Y = ■ • •} is the Markov chain of random points on the bound- 

ary 9G, with the initial density po and transition density p{yi — > yi+i) = 
1 ^ _ f{yo) _ Kiyo) ^ _ 



, and random weights are Qi = 

27t \y,+i - y,\^ 

1,2, ... ,n. Hence, biased estimators for uq are 
jin)r \ fiyo) 



Poivo)’ 



Q* = 



Poivo ) ’ 



= J2^r\-^oy^^^hx{y.) and 9^ = ^^^*(-1) 



,iK(yo) 



y{Vi) (for 



i=0 



Po{yo) ^ * Po( 2 /o) 

the Dirichlet problem). 

Construction of the Markov chain, Y , is based on its geometrical interpreta- 
tion. Given a point, yt, we simulate a random isotropic direction oji and find 
the next point yi+i as the intersection of this direction with the boundary 
surface dG. It is well known that different procedures can be used to choose 
uji = Wi^a). We consider the procedure based on the direct simula- 

tion of the longitudinal angle. Normally, an acceptance-rejection method would 
be used. But since we plan to use quasirandom numbers, this is inadvisable 
(see, e.g. [6]). So we use the following algorithm: = 1 — 2uip, ipi = 27Tai^2, 

d = Y^l — uii^i = sinipi/d, u>i^2 = cos pi/d, where the ai are standard 
uniform pseudorandom numbers in the unit interval. 



4 Quasirandom Walks on the Boundary 

In this section we discuss how to use quasirandom numbers for solving the 
boundary-value problems (2), (3) and (7). To construct Monte Carlo estimates, 
in §2, we reformulated the original problem into the problem of solving integral 
equations (5) and (8). So, in order to make use of these representations when 
constructing quasirandom estimates, we have to refer to a Koksma-Hlawka type 
inequality for integral equations, [2]: 

1 ^ 

1 



< V{9*) D%{Q) , 



( 12 ) 



Solving BVPs Using Quasirandom Walks on the Boundary 167 



where Q is a sequence of quasirandom vectors in [0, 1)®, s = d x T, and d is the 
number of QRNs in one step of a random walk, T is the maximal number of 
steps in a single random walk, and 9* corresponds to an estimate 9\Y] based on 
the random walk Y* generated from Q by a one-to-one map. Space precludes 
more discussion of the work of Chelson, but the reader is referred to the original 
for clarification, [2]. 

This inequality ensures convergence when 9* is of bounded variation in the 
Hardy-Krause sense, which is a very serious limitation. But even when this con- 
dition is satisfied, the predicted rate of convergence is very pessimistic due to the 
high (and, strictly speaking, possibly unbounded) dimension of the quasirandom 
sequence in the general Monte Carlo method for solving these integral equations, 
(e.g. (6)). To avoid this limitation, we consider variants of the method with each 
random walk having fixed length. Clearly, the smaller the dimension of Q, the 
better the rate estimate in (12). 

Guided by this reasoning, we used the representations (9) and (10), and the 
correspondent random estimators 9\ and 02, to construct quasirandom solutions 
to the original boundary-value problems. 

It is essential to note that despite the improved rate of convergence of our 
quasirandom-based calculations, constants in the error estimates are hard to 
calculate. On the contrary, the statistical nature of Monte Carlo solutions makes 
it possible to determine confidence intervals almost exactly. 



5 Numerical Tests 



We performed numerical tests with two problems. 

The first one is the Dirichlet problem for the Laplace equation inside a sphere 
with an exact analytic solution. 

To solve this problem with quasirandom sequences, we fix the length of series 
to be n = 4. That provides a 1% bias, and use 2(n-|-l)-dimensional Sobol, Halton 
and Faure sequences. We compared the approximate value of the solution at 
different points computed using MCM and QMCM. The QMCM solution shows 
a slightly better rate of convergence for Sobol and Halton sequences. 



Table 1. Test problem 1. Exact and approximate solution at different points 



X 


u{x) 


URAND 


SOBOL 


FAURE 


HALTON 


(0.9,0,0) 


0.124339 


0.125317 


0.124259 


0.123183 


0.125915 


(0, 0.9,0) 


0.074873 


0.079288 


0.074843 


0.081115 


0.075192 


(0, 0,0.9) 


0.074873 


0.077072 


0.075623 


0.085194 


0.074769 
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The second problem is (2), (3) with G being the unit cube with the only one 
unit point charge inside, = 4.0, Ce = 78.5. To calculate mq, defined by (4), 
we use the estimator 9i and compute point values for the same order of bias 
(number of terms, n = 4) and different sample number. 

Both tables show that we achieve a better acuracy when we replace pseudoran- 
dom numbers with quasirandom sequences. We performed numerical tests with 
Sobol, Halton and Faure sequences. 



Table 2. Test problem 2. Exact and approximate solution at the point 
(0.95, 0.95, 0.95) using different number of walks 



Ntr. 


u(x) 


URAND 


SOBOL 


FAURE 


HALTON 


100 


-0.02446 


-0.04086 


-0.02234 


-0.02102 


-0.02537 


1000 


-0.02446 


-0.02947 


-0.02635 


-0.02476 


-0.02418 


10000 


-0.02446 


-0.02634 


-0.02470 


-0.02436 


-0.02448 



6 Conclusions 

In this paper we presented a successful application of quasirandom sequences to 
the walks on the boundary method for solving boundary- value problems. The 
success is due to the following reason: instead of solving the original integral 
equation arising from integral representation of the solution, we parameterized 
the equation, analytically continued the solution, and used a special substitution 
to accelerate the convergence significantly. In this way, the problem was reduced 
to solving a small number of multidimensional integrals. This is the key point 
for the successive use of quasirandom sequences - they are designed to solve 
multidimensional integrals with a better rate of convergence than arises from 
pseudorandom numbers. 

We tested our approach by solving two problems using pseudorandom num- 
bers and the Sobol, Halton and Faure sequences. The accuracy of the quasiran- 
dom walks on the boundary method is better and the advantage of this method 
is significant for the second test problem. 
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Abstract. Backward Monte Carlo methods for solving the Boltzmann 
equation are investigated. A stable estimator is proposed since a pre- 
viously published estimator was found to be numerically unstable. The 
principle of detailed balance, which is obeyed by state transitions of a 
physical system and ensures existence of a stable equilibrium solution, 
is violated by the transition probability of the unstable method, and is 
satisfied by construction with the proposed backward transition proba- 
bility. 



1 Introduction 

For the numerical study of non-equilibrium charge carrier transport in semi- 
conductors the Monte Carlo (MC) method has found wide spread application. 
In particular the physically transparent forward MC method is commonly em- 
ployed, which evaluates functionals of the distribution function. The more ab- 
stract backward MC method, however, has found virtually no application in 
semi-classical transport calculations. This method follows the particle history 
back in time and allows the distribution function to be evaluated at given points 
with desired accuracy. The method is particularly appealing in cases where the 
solution is sought in sparsely populated regions of the phase space only. 

In the field of semi-classical transport the backward MC method has been 
proposed end of the 1980’s [4, 8]. One of the roots of this method is in quantum 
transport [1] , a field where also various applications are reported [7, 3] . 

2 Boltzmann Equation 

On a semi-classical level the transport of charge carriers in semiconductors is de- 
scribed by the Boltzmann Equation (BE) . For the time- and position-dependent 
transport problem the BE reads 

(^ + v(k) • Vr -l-F(r,t) • Vk) ,/(k,r,t) = Q[/](k,r,f) , v&D. (1) 

This equation is posed in the simulation domain D and has to be supplemented 
by boundary and initial conditions. The distribution function is commonly nor- 
malized as /pdr J dk/(k, r,t) = 47r^7Vi)(t), with Njy denoting the number of 
carriers contained in the semiconductor domain of volume Vd- 
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In (1) the carrier’s group velocity v is related to the band energy e(k) by 
V = ?i“^Vfce(k). The force field F takes into account electric and magnetic fields. 
If only an electric field E is present, the force field is given by F = q^/h, where 
q is the charge of the carrier. The scattering operator Q = Qg — Qi consists of 
a gain and a loss term, respectively. If many-body effects such as carrier-carrier 
scattering and degeneracy are neglected, the scattering operator will be linear, 
an assumption that is crucial for the presented approach. The two components 
of Q are 

Qs[/](k,r,t) =y /(k',r,t)S'(k',k,r,t)dk' , (2) 

Qi[f]{k,r,t) =A(k,r,t)/(k,r,f) , (3) 

with A(k,r,t) = f S'(k, k', r, t) dk' denoting the total scattering rate. 

2.1 Integral Form of the Boltzmann Equation 

The BE is now transformed to integral form by a formal integration over a phase 
space trajectory. 

r r 

K{t) =ko + J F (R{y),y)dy , R(r) == ro + y v (K(y)) dy (4) 

to tQ 

This trajectory has the initial condition K(to) = ko and R(to) = Tq and solves 
the equations of motion in phase space, given by Newton’s law and the carrier’s 
group velocity. The following integral form of the BE is obtained [5]: 

t 

/(k,r,t)= y dt' J dk' /(k',R(0,t') 

0 

t 

xS{k',K{t'),R{t'),t')exp(^- j A(K(y),R(y),y)dy) 

t 

+ /,(K(0),R(0))exp(^- y A(K(j/),R(y),y)d 2 /^ (5) 

0 

This equation represents the generalized form of Chamber’s path integral [2]. 
The source term contains /j, a given initial distribution. Augmenting the kernel 
by a delta-function of the real space coordinate and a unit step function of the 
time coordinate allows transformation of (5) into an integral equation of the 
second kind: 

t 

K{k,r,t,k',r',t') = S{k' ,K{t'),r' ,t') exp(^~ J X{K{y),R{y),y)dy'^ 

t' 



X S{r' (6) 
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f{k,r,t)= J dt' J dk' J dr' K{k,r,t,k\r' ,t')f{k' ,r' ,t') + (7) 

0 

2.2 The Neumann Series 

Substituting (5) recursively into itself gives the Neumann series expansion of the 
solution / in a given phase space point k, r at time t. 

/(k,r,t) = /W+/« + /(2) + ... (8) 

Convergence of the series has been proven in [6]. MC algorithms for solving an 
integral equation can be derived by evaluating the terms of the iteration series 
by MC integration. Each term of the iteration series has the same structure. In 
the term of order n the integral operator is applied n-times to the source term 
/o. As an example we write the term of second order explicitly. 



Z 

/^^^(k,r,t)= J dti y*dki J dt2 J 

0 0 

t2 

/,(K2(0),R2(0))expf- J A(K2(y),R2(2/),y)dy 
0 

tl 

X S{k 2 ,Ki{t 2 ),Ri{t 2 ),t 2 )exp(^~ J X{Ki{y),Ri{y),y)dy ^ 

t2 

t 

X S{ki,Ko{ti),Ro{ti),ti)exp^~ J X{Ko{y),Ro{y),y)dy'^ (9) 



Final conditions for the k-space trajectories are given first by Ko(t) = k and 
then by the before-scattering states Ki(ti) = ki and K 2 (t 2 ) = k 2 (See Fig. 1). 
The real space trajectory ends at final time t in the given point Ro(t) = r and 
is continuous at the time of scattering: Ri(ti) = Ro(ti), R 2 (t 2 ) = Ri(t 2 )- 

The iteration term (9) describes the contribution of all second order trajec- 
tories to the solution. On such a trajectory a particle undergoes two scattering 
events during propagating from time 0 to t and is found on its third free-fiight 
path at time t. 



3 Backward Monte Carlo Methods 

Backward MC algorithms for the solution of the Boltzmann equation have been 
proposed in [4, 8]. Given an integral equation f{x) = f K{x, x')f{x')dx' + fo{x), 
the backward estimator of the n-th iteration term is constructed as 

K{xq,Xi) K{Xn-l,Xn) „ , . 

p{xo,Xi) p{Xn-l,Xn) 






( 10 ) 



A Stable Backward Monte Carlo Method 



173 




Fig. 1. Sketch of a backward trajectory starting at time t and reaching time 0 after 
three free flights. The symbols used in (9) are shown. 



where p denotes a transition probability density. The set of points Xq, Xi, . . . Xn is 
referred to as a numerical trajectory. After generation of N numerical trajectories 
the n-th iteration term is estimated by the sample mean 

1 ^ 

( 11 ) 

^ S^l 

For the Boltzmann equation considered here the variable x denotes x = (k,r,t). 
In this work we discuss two specific choices of the transition probability p. 

3.1 Probability Density Functions 

The components of the kernel (6) are used to construct probability density func- 
tions (pdf). From the scattering rate S one can define a pdf of the after-scattering 
states kji, 

n 1 X _ >5'(k6,ka) 

PkiKM)- , ( 12 ) 

with the total scattering rate A(k(,) = f 5'(kh,ka) dk^ as a normalization factor. 
Conversely, the pdf of the before-scattering states k(, is defined as 

pUkM = ^p^. ( 13 ) 

The normalization factor is given by the backward scattering rate, A*(ka) = 

/ S'(k{,,ka)dkf,. 

The pdf of the backward free-flight time ti is given by 

^3 

Pt(t„tj,kj) = X{Kj{U))exp ^ - y A(Kj(j/))dy^ , 



(14) 
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and satisfies the normalization = 1- Final condition of the tra- 
jectory is = kj. 



3.2 The Source Term 



In case of the Boltzmann equation the source term is treated in a specific way. 

t 

/o(k,r,t) = /,(K(0),R(0))exp(^- J A(K(y))dy^ (15) 

0 

The exponential function represents the probability that a particle moves with- 
out scattering from 0 to t. This probability is now used as an acceptance proba- 
bility. The probability is expressed as an integral over the respective pdf, given 
by (14). 

t 0 

exp(-/ A(K(y))dy^ = J pt(r,t,k)dr (16) 

0 — C50 

The acceptance probability is checked as follows. For a particle in state k at 
time t the backward free flight time t is generated from pt. If t is negative, 
the estimator is nonzero, otherwise, the estimator evaluates to zero. To obtain 
a nonzero estimator of the n-th iteration term all the generated times , . . . 
must be positive, whereas the next time generated, tn+i, must be negative. For 
a trajectory of order n all other estimators of order m ^ n evaluate to zero. In 
this way an estimator for the distribution function / is obtained as 

OO ^ ^ 

f{xo) = ^/(")(a;o) (17) 

n=0 s=l 



Here n(s) denotes the order of the s-th numerical trajectory. The estimator is 
now defined as 






K{xq,Xi) 

p{xq,xi) 



7F(x^_i, Xji) 

p{Xn-l,Xn) 



/,(K„(0),R„(0)). 



(18) 



Note that this estimator samples the initial distribution /^, whereas in (10) the 
source term /g is sampled. 



3.3 Transition Probability Densities 

In the original work [8] ^(kh, k^,) is interpreted as the unnormalized distribution 
of the before-scattering states k;,, and consequently the normalized pdf (13) is 
employed. Using the transition density 



P(k', r, t'; k, r, t) = pl(k' , K(t')) Pt{t\ t, k) ,5(r' - R(t')) H{t - t') (19) 
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the estimator (18) becomes 






A*(Kp(ti)) A*(K„_i(t„)) 
A(Ko(ti)) ■■■ A(K„_i(t„)) 



/,(K„(0),R„(0)). 



( 20 ) 



Although the MC algorithm based on the estimator (20) is consistently derived 
from the integral form of the BE, computer experiments reveal a stability prob- 
lem. The particle energy becomes very large when the trajectory is followed 
backward in time. The initial distribution takes on very small values at high 
energies, such that many realizations of the estimator will be very small. With 
small probability, the particle energy will stay low, where the initial distribution 
is large. These rare events give large contributions to the estimator, resulting 
in a large variance. The computer experiments show that the variance increases 
rapidly with time. However, for a given time t the variance of the estimator is 
finite. 

The time evolution of the particle energy can be understood from a property 
of the scattering rate known as the principle of detailed balance. This property 
ensures that in any system particles scatter preferably to lower energies. If for 
trajectory construction the backward transition rate (13) is employed, the prin- 
ciple of detailed balance is inverted in the simulation and scattering to higher 
energies is preferred. 

The principle of detailed balance is reflected by the following symmetry prop- 
erty of the scattering rate: 



^(ki,!^^) = S'(kj,ki)exp(/3(e(ki) - e(kj))) , (21) 

where (3 = l/{kBT) and e(k) denotes the carrier energy. The scattering rate 
of carriers in a semiconductor contains contributions from various scattering 
sources and is thus represented by the sum of the corresponding rates. 

^(k„k,) = ^5i(k„k,) (22) 

i 

The total scattering rate is given by A(k) = A;(k). A scattering mechanism is 

either elastic, that is e(ki) = e(kj), or inelastic, e(ki) ^ e(kj). For each inelastic 
process the sum contains two entries, where one entry describes the inverse 
process of the other. In the case of phonon scattering these partial processes 
are caused by absorption and emission of a phonon, respectively. The scattering 
rates are commonly derived from Fermi’s golden rule, 

O-jT 

5ab(k,,k,) = — |M|2iV,A(e(k,) + ;iw-e(k,)), (23) 

O-jT 

5em(k,,k,) = — |M|2(Ar, + l)<5(e(k,)-St^-e(k,)), (24) 

where M denotes the interaction matrix element, Nq the Bose-Einstein statistics 
and huj the phonon energy. Interchanging kj and k^ and taking into account the 
relation {Nq + 1) = Nq exp{Phuj) gives the following symmetry property 

5'ab(kj,kj) = S'em(kj,kj)exp(-/3;F;) 

56111 (kj,kj) = 5ab(kj,ki)exp(/3&j) 



(25) 

(26) 
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Fig. 2. Electron energy distribution functions obtained by backward and forward MC 
algorithms. 



This formulation shows that the absorption rate in backward direction is pro- 
portional to the emission rate in forward direction, and vice versa. From (25) 
and (26) it follows then that S = S'ab + S'em has the symmetry property (21). 

The stability problem can be solved using the forward scattering rate also 
for the construction of the backward trajectory and changing the estimator ac- 
cordingly. In the transition density the forward pdf (12) is employed. 

P(k',r',t';k,r,t) =pfc(k',K(0) Pt{t\t,k) <5(r' - R(t')) (27) 

The estimator (18) becomes 

j/(")(k,r,t) = exp(/3Z\ei) . . . exp(/3zle„)/i(K„(0), R„(0)) . (28) 

The Aei denote the difference in particle energy introduced by the Z-th scattering 
event. 

4 Results 

MC calculations of electron transport in silicon have been performed. Condi- 
tions assumed are E = lOkV/cm and t = 3ps at T = 300 K. Fig. 2 compares 
the electron energy distributions as computed by the backward MC method 
and a forward MC method employing statistical enhancement through event 
biasing. As initial distribution a Maxwellian distribution is assumed. The back- 
ward method is used to evaluate the energy distribution at discrete points above 
800 meV. The statistical uncertainty of the result is controlled by the number 
of numerical trajectories starting from each point. In the simulation 10^ back- 
ward trajectories are computed for each point. The backward method resolves 
the high energy tail with high precision as shown in Fig. 2. The depicted range 
of 30 decades is out of reach for the here considered variant of the forward MC 
method. 
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Abstract. A stochastic method for simulation of carrier transport in 
semiconductor nanostructures is presented. The Wigner formulation of 
quantum mechanics is used. The method which aims at evaluation of 
mean values of physical quantities, is obtained by following the rules of 
the Monte Carlo theory. A particular numerical feature of the method 
are statistical weights with inverse signs which can achieve large absolute 
values. They give rise to high variance of the simulation results, the 
so called sign problem in quantum Monte Carlo simulations. A weight 
decomposition approach is proposed which limits the value of a weight 
by storing part of it on a phase space grid. Annihilation of positive and 
negative stored weights, which occurs during the simulation significantly 
improves the variance of the method. 

1 Introduction 

The Wigner-function (fw) formalism has been recognized as a convenient ap- 
proach to describe electron transport in mesoscopic systems. The formalism com- 
bines a rigorous quantum-kinetic level of treatment with the classical concepts 
for phase space and open boundary conditions. Moreover dissipation processes 
introduced by phonons can be taken into account by adding the classical Boltz- 
mann collision operator to the Wigner operator Vw [3] . The basic similarity with 
the classical transport picture motivates a Monte Carlo (MC) method for solv- 
ing the Wigner equation. For the sake of transparency the method is described 
for the case of stationary, coherent one-dimensional transport. A generalization 
for phonon scattering follows the same approach. The corresponding Wigner 
equation reads: 



where V is the device potential, x is the position, and hk is the momentum. 

The proposed method aims at evaluation of the mean value (A) of a given 
physical quantity A(x, k). The basic expression is obtained by a series expansion 
of {A) derived in the next section. 
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2 The Series Expansion of (A) 

As a first step an integral form of the Wigner equation is obtained. A non- 
negative function iy(x) is introduced, which will be determined later, and I'ix) 
f{x, k) is added to both sides of (1). The characteristics of the differential oper- 
ator are classical Newton’s trajectories 

hk 

xit) = X -I- v(k)t, k(t) = k] vik) = — 

m 

A trajectory (x(t),fc(f)) is initialized by the phase space point (x, fc) at time 0. 
v{k) is the electron velocity and the parameterization is backward in time, t < 0. 

Newton’s trajectories are used to transform (1) into the following integral 
equation: 



/^(x,fc) = fdt'J 

+ e-^H’'^^^y^^^yf,{x{t,),k{h)) ( 2 ) 

T(x, fc, k') = v{x)5{k' — k) + Vw{x, k' — k) 

Here ti, is the time of the trajectory crossing point x{tb) with the device boundary, 
where the Wigner function values /& are known. 

The mean value (A) is defined by the inner product of A and f^' 



{A) = JdxJdk /(x, k)A{x, k) 



In this formulation {A) is obtained from the solution f^, which can be evaluated 
by a backward MC method. 

An alternative formulation leads to a forward MC method which directly 
evaluates the mean value (A) . Considered is the conjugate equation of (2) with 
a free term A and solution g. The following relation can be established: 

/dx /"d/c /(x, fc)A(x, fc) = [dx [dk fb{x{tb),k{tb))e fc) (3) 



Equation (3) shows that the mean value of A is determined by the inner product 
of the free term of (2) with the solution g of the conjugate Wigner equation. 
The latter is obtained from (2) by applying the same steps used to derive the 
conjugate Boltzmann equation [2]. 

g{x,k) = J dtjdk' r{x,k' A)e~ -^ 0 ’^(■^(y'>'>‘^y0jj(^x)g{x{t),k' (t)) + ( 4 ) 

Here the indicator 9o{x) of the device domain D takes values 1 if x G I? and 
0 otherwise. The trajectories are in a forward parameterization, t > 0. The last 
term in (3) is also expressed in a forward parameterization [2]. Then the space 
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integral gives rise to a time integral and a sum over the left and right boundaries 
xi, Xr of the one-dimensional device: 



(A) = ^ J dtoJdki\v{h)\fb{xb,ki)e fc(io)) 



( 5 ) 



The index i denotes the subspace of wave vectors pointing inside the device. 

The iterative expansion of g, obtained from (4), is replaced in (5) which leads 
to the series expansion: 



i 



( 6 ) 



In this way the mean value of any physical quantity is determined by the velocity 
weighted boundary conditions and the consecutive iterations of the conjugate 
kernel K on the free term A. The expression for (^)i is shown as an instructive 
example: 



<A>i= E 

Xb—Xi,Xr 




\v±{h)\fb{xb,ki)9D{xb{to)) X 



( 7 ) 






Due to 9d only that part of a Newton’s trajectory which belongs to D contributes 
to (A)i. 

The series (6) is the main entity of the stochastic approach. Following the 
MC theory, numerical trajectories are constructed which evaluate the consecutive 
terms of the series. The numerical trajectories are constructed by an initial P 
and a transition PP probability density chosen for this purpose. P is used to 
choose the initial points of the numerical trajectories on the device boundary. 
The initial density P is adopted from the classical single-particle MC method 
[2], since \v\fb in (5) resembles the boundary term of the Boltzmann transport 
case. 

The trajectories are built up by consecutive applications of the transition 
density PP. At each iteration i of the kernel a quantity called statistical weight 
w is multiplied by the weight factor Wi = K/PP. The random variable 
whose realizations sample {A)i is evaluated from the product of wA(xi, ki). The 
random variable corresponding to (A) is given by ^a = ■ 

The choice of the transition density plays an important role for the numer- 
ical properties of the MC algorithm. The kernel can be expressed as a product 
of a weight factor Wi and conditional probability densities which comprise the 
transition density PP\ 

K = PP 1 PP 2 . . . w. 

In the following two possible formulations are discussed. In the first case the 
numerical trajectories are associated with particles which evolve on parts of 
Newton’s trajectories, linked by scattering processes. The kernel is interpreted 
as a scattering operator. In the second case a weight decomposition is proposed. 
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This is achieved by splitting the evolving trajectory in phase space. The kernel 
is interpreted as an operator which generates particles during the process of 
evolution. 



3 Scattering Interpretation of the Kernel 



The kernel of (4) is augmented by extra factors, introduced in a way to conserve 
the value of the integrand. They ensure the normalization of the conditional 
probability densities which are enclosed in curly brackets. 



K = 



+ 






(im , 


f |K,(a:(t),fc(t) - fc')l 1 


l Kt) J 2' 


1 7(0 is. 







Here 'y{x{t)) = J \Vw{x{t),k)\dk. Short notations a{t) = a{x{t)) are used for 
a = "f,v, ^,9 d- The subscripts denote the order of application of the conditional 
probabilities. {}i generates a value of t, associated with a free flight time of a 
particle which drifts over a piece of Newton’s trajectory between an initial state 
{x, k) at time 0, and the flnal state, {x{t), k{t)). The flnal state is used in the next 
probability density to select the value k' . The transition between k{t) and k' is 
interpreted as scattering of the particle. {}2 is the probability to use the first 
kernel component for selection of the after-scattering value of k' . Since 
the second component is selected according |} 2 ' = 1 — {} 2 - Thus k' is chosen 
either with the probability density {ja or with the probability density {}s'. The 
normalization of the latter is ensured by the function 7 . The after-scattering 
state (k',x{t),t) is the initial state of a free flight for the next iteration of the 
kernel. 

The weight factor Wi = where the sign is given by the sign of V^, 

multiplies the weight w accumulated by the trajectory. It can be shown that the 
mean accumulated weight is evaluated hy w = where T is the dwelling 

time of the particle in the device. Since the weight w does not depend on v, the 
latter can be chosen according to some criteria of convenience. A choice v = y /2 
gives rise to a weight factor Wi = ±3. 

For typical nanoscale devices 7 ~ T > so that w and thus 

the realizations of which sample {A) become huge positive and negative num- 
bers. The sample variance is orders of magnitude larger than the sample mean, 
where positive and negative terms cancel each other. This problem, known as 
the sign problem, exists also in stochastic approaches to alternative formulations 
of quantum mechanics. For instance MC evaluations of Feynman path integrals 
demonstrate an exponential growth of the variance with the evolution time [4]. 

Due to the sign problem, the application of the derived method is restricted 
to single barrier tunneling and small barrier heights. 
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4 Weight Decomposition Approach 

The following modification of the method is proposed to overcome the problem 
of the growing variance. The idea is to decompose the weight into a part which 
continues with the trajectory and a part which is left for future processing. The 
antisymmetric function Vw is decomposed into two positive functions: = 

— V~ . With these functions v can be expressed as = y/2 = J V^dk. The 
kernel is written as: 

K = dtj dk'^^ly{t)e- 0D{t) (^{6{k{t) - fc')}2 
f V+{x{t),k{t)-k')\ \ f y-(x(t),k(t) - k')\ 

I KO i 2 1 Ki) 

The three kernel components simultaneously create three after-scattering states. 
Each state gives rise of a trajectory which must be simulated until it leaves the 
simulation domain. 

Two of the trajectories carry the weight of the trajectory before the scattering 
event, the third one changes its sign. Since the initial weight at the boundary is 
1, the absolute value of the weight of each trajectory remains 1. The following 
picture can be associated to the transport process. With each iteration of the 
kernel a positive (negative) particle undergoes a free flight and scattering event. 

After the scattering event the particle survives in the same state with the 
same weight due to the delta function. Additionally a positive and a nega- 
tive particle are created by . A phase space grid {nA^, mAk), n = 1, ... TV, 
m = —M, . . ,,M is introduced, where all three particles are stored. The simu- 
lated trajectory continues from the grid cell with the highest number of stored 
particles. It is selected among all k cells with a position index m = int{x{t) / A^) 
where x{t) is the location of the scattering event. 

Positive and negative particles have opposite contribution to the statistics. 
They have the same probabilistic future if located close together in the phase 
space and thus can be canceled. The active cancellation reduces the simulation 
time leading to a variance reduction. 

If phonon scattering is included the picture remains similar. The free flights 
are additionally interrupted by the phonon scattering events, which as in the 
classical case, only change the particle momentum. 

5 Results and Discussions 

The method has been applied for simulation of a resonant tunneling diode. A 
double barrier structure from the literature [5] is used as a benchmark device. 
Physical parameters of GaAs with a uniform 0.067 effective mass and a temper- 
ature T= 77K is assumed. The potential drop is linear across the central device, 
the barriers have a thickness of 2.825nm and a height 0.27eV, the well is 4.52nm 
wide. The device is shown schematically in Fig.l. 
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Fig. 1. Electron concentration distribution in the central part of the resonant tunneling 
diode for three different bias points. 



Numerical results for the electron concentration, A„(xq) = 5{x — Xq), are 
presented for three different values of the applied bias. The values correspond 
to bias points near equilibrium, and the peak and valley current of the current- 
voltage characteristics. The accumulation of electrons in the quantum well at the 
resonance bias 0.13V respectively is well demonstrated. The physical quantity 
which gives the current is Aj = qvik) j where q is the electron charge and 
Ld is the length of the device. The current- voltage characteristics are shown on 
Fig. 2. The peak of the current at 0.13V is followed by a negative differential 
mobility region, which is a typical feature of the resonant tunneling diodes. 
The current- voltage characteristics are in good quantitative agreement with the 
results obtained by other methods [1]. 

Fig. 3 shows the current for the chosen three bias points as a function of the 
number of scattering events. The latter quantity is an empirical measure of the 
elapsed simulation time which is independent of the computer platform. 10^° 
scattering events correspond to a 24 hours simulation time on a IGHz CPU. 
The three curves illustrate the variations of the {Aj) during the simulation. 
Above 8.10® scattering events the variance of the corresponding values becomes 
negligible. 

6 Conclusions 

The proposed weight decomposition method significantly improves the numer- 
ical properties of the stochastic approach to the Wigner transport equation. 
Annihilation of positive and negative weights reduces the computational effort 
of the task. The method has proven suitable for the simulation of nanoelectronic 
devices. 
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Fig. 2. Current-voltage characteristics of 
the resonant tunneling diode. The peak is 
at resonance bias 0.13V. 



Fig. 3. Electron current for the chosen 
bias points as a function of the number 
of scattering events. 
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Abstract. A Monte Carlo method for calculation of the carrier mobil- 
ity in degenerate bulk semiconductors at zero electric field is presented. 
The method is obtained as a limiting case of an existing small-signal 
approach replacing the distribution function by the Fermi-Dirac distri- 
bution which is valid at zero electric field. The general form of the Boltz- 
mann equation which takes into account the Pauli exclusion principle 
in the scattering term is used to derive the integral representation of 
a Boltzmann-like equation for a small perturbation of the distribution 
function. The method allows calculation of the whole mobility tensor in 
comparison with the one particle Monte Carlo algorithm which is tradi- 
tionally used to compute low field carrier mobility. 



1 Introduction 

The low field carrier mobility of a bulk semiconductor is an important kinetic 
property of a semiconductor. It is used to analyze the carrier transport in semi- 
conductor devices at low applied voltages and enters expressions for high field 
mobility models as an additional parameter. Thus the knowledge of the low field 
carrier mobility and its correct dependence on the material properties such as 
the doping concentration are necessary to adequately simulate carrier transport 
in semiconductor devices. 

The standard approach for obtaining the low field carrier mobility is a single 
particle Monte Carlo method. Within this method in order to calculate the low 
field mobility along the direction of the electric field one should carefully choose 
the magnitude of the applied electric field. On the one hand side, the magnitude 
of the electric field must be as low as possible. In principle it is desirable to have 
zero electric field. However, there exist limitations related to the increase of the 
variance of standard Monte Carlo methods. On the other hand side, the field 
must not be too high to avoid a mobility reduction due to carrier heating. 

In addition to these disadvantages, the standard approaches only give one 
component of the carrier mobility namely the component along the direction of 
the electric field. For isotropic conditions it does not make any difference since 
the mobility tensor is diagonal and all diagonal values are equal. However when 
anisotropy is present, for example in strained semiconductors, the mobility tensor 
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elements may be different and several Monte Carlo simulations are required to 
obtain all the components of the tensor. 

To overcome these problems associated with standard Monte Carlo meth- 
ods a new Monte Carlo algorithm has been suggested recently [8], which solves 
the Boltzmann equation at exact zero field and represents a limiting case of a 
small signal algorithm obtained in [6] . One of the most remarkable properties of 
the algorithm is the absence of self-scattering that allows to significantly reduce 
calculation time and achieve very good accuracy of the results. This method 
is restricted to the simulation of low doped semiconductors. The quantum me- 
chanical Pauli exclusion principle is not included in the scattering term of the 
Boltzmann equation used for the derivation of the algorithm. As a result there 
are limitations on the doping level of materials analyzed by this technique. It 
allows to obtain excellent results at low and intermediate doping levels while 
results obtained for higher doping levels where the effects of degenerate statis- 
tics are more pronounced are incorrect. As the standard Monte Carlo methods 
exhibit a very high variance especially in degenerate case, it is thus desirable to 
have a powerful technique to analyze the carrier mobility at high doping levels. 

In this work we present a zero field algorithm to account for degenerate 
statistics. The Pauli exclusion principle is taken into consideration in the scat- 
tering term of the Boltzmann equation. As a result the Boltzmann equation 
becomes nonlinear. Using this nonlinear equation we derive a generalized zero 
field algorithm applicable for the analysis of highly doped materials. 



2 Nonlinear Boltzmann Equation 
and Its Linearized Form 



Let us consider a bulk homogenous semiconductor. Then we can neglect the 
space dependence of the distribution function and the differential scattering 
rate. We also assume the differential scattering rate to be time invariant. With 
these conditions the time dependent Boltzmann equation taking into account 
the Pauli exclusion principle takes the following form: 



dt 



-k 



qE{t) 

h 



v/(fe,t) = g[/](fe,t), 



( 1 ) 



where E{t) is an electric field and q is the particle charge. Q[f]{k,t) represents 
the scattering operator which is given by the following expression: 



Q[f](k,t) = J fik,t)[l-f{k,t)]S{k\k)dk- 
- J f{k,t)[l-f{k\t)]S{k,k)dk\ 



( 2 ) 



where S{k , k) stands for the differential scattering rate. Thus S{k , k)dk is the 
scattering rate from a state with wave vector k to states in dk around fe, f{k, t) 
is the distribution function and the factors [1 — f{k, t)] mean that the final state 
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must not be occupied according to the Pauli exclusion principle. As can be seen 
from (2), there are terms f{k,t)f{k ,t) which render the equation nonlinear. 
Only when the condition f{k,t) <C 1 is valid the factors [1 — f{k,t)] can be 
replaced by unity and the equation takes the usual linear form. 

To linearize (1) we write the electric field in the form: 

E{t) = E, + Ei{t), (3) 

where Eg stands for a stationary field and Ei{t) denotes a small perturbation 
which is superimposed on a stationary field. It is assumed that this small per- 
turbation of the electric field causes a small perturbation of the distribution 
function which can be written as follows: 



/(fcO) = /.(fc) + /i(fe,t), (4) 

where fs(k) is a stationary distribution function and /i(fc, t) is a small deviation 
from a stationary distribution. Substituting (4) into (2) the scattering operator 
Q[f]{k,t) takes the form: 

Q[f]{k,t) = [ {fg{k) + Mk\t))[l - fs{k) - h{k,t)]S{k,k) dk- 

( 5 ) 

- / Us{k) + h{k,t))[l- fg{k)~ h{k ,t)]S{k,k)dk . 



It should be noted that in spite of the fact that fi{k,t) <C 1 one should take 
care when linearizing terms such as 1 — fs{k) — fi{k,t) as especially in the 
degenerate case it may happen that the inequality 1 — fs{k) fi{k, t) becomes 
valid because of [1 — /s (A:)] ^ 0. 

Neglecting terms of second order we derive the zeroth order equation: 



^EgVfg{k)=[l- fg{k)] J fs{k)S{k,k)dk- 
-fs{k) J[l-fg{k)]S{k,k)dk, 



( 6 ) 



and the first order equation: 

+ lEgWh{k,t) = -|£;i(t)V/,(fc) +QW[/](fc,t), (7) 

where we introduced the notation t) for the first order scattering op- 

erator which has the form: 

QW[/](fe,t) = [l-/,(fc)] J h{k\t)S{k',k)dk- 

-fi{k,t) J[l-fg{k)]S{k,k)dk'-h{k,t) j fs{k)S{k\k)dk+ ( 8 ) 

+ fs{k) J h{k ,t)S{k,k)dk . 
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Equation (7) is linear with respect to /i(fe, t), it is a kinetic equation which differs 
from the usual form of the Boltzmann equation. The first difference is related 
to the presence of the additional term on the right hand side being the term 
proportional to E\ which additionally depends on the stationary distribution. 
The second difference is in the expression for the scattering operator which now 
has a more complex form and also depends on the stationary distribution. 



3 Integral Form of the First Order Equation 



To construct a new Monte Carlo algorithm we reformulate the Boltzmann equa- 
tion of the first order into integral form. For this purpose we introduce a new 
differential scattering rate and new total scattering rate defined by the following 
expressions: 

S{k,k) = [1 - fs{k)]S{k,k) + fs{k)S{k, k), (9) 

A(fe) = J{[l-f,{k)]Sik,k) + Mk')Sik\k))dk= js{k,k)dk. (10) 

It is worth noting that the similarity with the standard Boltzmann equation is 
only formal as both, differential scattering rate and total scattering rate, are 
now functionals of the stationary distribution function which is the solution of 
the equation of zero order (6). 

With these definitions the scattering operator of the first order Q^^'^[f](k,t) 
formally takes the conventional form: 



QW[/](fc,t) = J h{k,t)S{k',k)dk -h{k,t)X{k), (11) 

and the Boltzmann- like equation can be rewritten as follows: 

^Il^ + lEsVhik,t)= f fi{k,t)S{k,k)dk- 

dt h ^ J (^2) 

-h{k,t)\{k)-^E,{t)VUk). 

We derive the integral form of this equation using techniques described in [7]. 
Introducing a phase space trajectory K{t ) = k — ^Es{t — t) which is the 
solution of Newton’s equation, and taking into account that /i(iT(to), tg) = 0 
for to < 0 we obtain the following integral form: 



= [ dt [ dk fi{k' ,t)S{k\K{t)) -expi- I X[K{y)]dy\- 



( 13 ) 



E,(t )[Vfs]iK{t )) . exp - / X[K(y)] dy]dt. 
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Finally we assume an impulse like excitation of the electric field, Exit) = 
and obtain: 

fi{K{t),t) = 

= I dt J dk fi{k' ,t)S{k ,K{t)) ■ exp(^~ J^ X[K{y)]dy^+ 

+ G{K{0))exp(^- j\[K{y)]dy 

where 

G(fc) = -^E^mVfsik). (15) 

The essential difference of this integral representation from the one of the non- 
degenerate approach consists in the appearance of the new differential scattering 
rate S{k , fc) and of the total scattering rate A(fc). Another difference which is 
common for both approaches is the additional free term on the right hand side 
which in general cannot be treated as an initial distribution because it also takes 
negative values. 



4 Zero Field Approach 



When the electric field tends to zero, the distribution function approaches its 
equilibrium which is in case of particles with fractional spin represented by the 
Fermi-Dirac distribution function: 



fpoie) 



exp 



1 



Ef — e 
ksTo 




(16) 



where Ef denotes the Fermi energy, e stands for an electron energy and Tq is an 
equilibrium temperature equal to the lattice temperature. Since the stationary 
distribution is known, there is no necessity to solve the zeroth order equation 
(6). As can be seen from (16), in equilibrium the distribution function depends 
only on the carrier energy and not on the wave vector. This fact allows one to 
significantly simplify (10) using the Fermi golden rule [2]: 

S{k, k) = - e{k) ± Z\e]. (17) 

Making use of the delta function in the last expression we can rewrite (10) in 
the following manner: 

A(fc) = [1 - /FD(e/)]A(fc) -I- /Fc(e/)A*(fe), (18) 

where e f denotes the final carrier energy and we introduced the backward scat- 
tering rate S*{k,k') = S{k\k) and A*(fc) = f S*{k,k') dk' . (18) represents a 
linear combination of the forward and backward total scattering rates. In the 
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Fig. 1. Schematic illustration of the scattering processes at high degeneracy. 



non-degenerate case, fp-Di^) “C 1, we obtain A(fc) = \{k) that means that 
scattering processes are mostly determined by the forward scattering rate and 
thus the algorithm developed in [8] for nondegenerate statistics is restored. On 
the other hand side for highly doped semiconductors, /f-d(c) ~ 1, scattering 
processes are dominantly backward A(fe) = A*(fe). In case of intermediate doping 
levels both forward and backward scattering contribute to the kinetics. It should 
be also mentioned that as at high doping levels the backward scattering rate is 
dominant, the probability to scatter to higher energy states is larger than to 
lower energy states as schematically shown in Fig. 1(a). This means that lower 
energy levels are already occupied by particles i.e. fp-Di^) ~ 1 (see Fig. 1(b)) 
and, due to the Pauli exclusion principle, scattering to these energy levels is 
quantum mechanically forbidden. 

The additional free term in (14) cannot be considered as an initial distribution 
because function G{k) may take negative values. However, in case of zero electric 
field the stationary distribution is known analytically and we can calculate G(fe) 
explicitly: 



G{k) 



q 

ksTo 



exp 



Ef—e 

fcsTb 



V 



exp 



Ef-e 

fcsTo 



+ 1 



2 ’ 



(19) 



where v denotes the group velocity. This expression can be rewritten in the 
following manner: 



^ qE,m{\) v{k)[l- fppje)] f X{k)fFD{e) \ 

ksTo A(fc) I (A) J 



( 20 ) 



where the term in curly brackets represents the normalized distribution function 
of the before-scattering states. The Monte Carlo algorithm contains the same 
steps as those in [8] except that the whole kinetics must be now considered 
in terms of S{k,k ) and A(fc) instead of S{k,k ) and A(fc). It may be seen by 
derivation of the second iteration term of the Neumann series for (14) using the 
forward formulation to obtain the ensemble Monte Carlo algorithm: 
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(/f = [ dt2 [ dti f 

\J 0 J to 



dk% 



exp - 



dkl 

\[K2{y)dy]\\[K2{t2)] 



dk,{G{k,)}- 

/5[K2(t2),fci] 



exp - 



\[K^{y)dy]]\[K^{G)\ 



't2 



I \[K2{t2)\ 



miiti)] 



( 21 ) 



■exp(-^ X[K{y)]dy^eQ{K{t)), 



where fc“ stands for an after-scattering wave vector and ki denotes an initial wave 
vector. The quantity ^ represents a normalized after-scattering distribution. 

A[fc] 

As can be seen from (9) and (10) it is normalized to unity. As it follows from 
(21), during Monte Carlo simulation a particle trajectory is constructed in terms 
of new quantities S and A. 

Another difference from the non-degenerate zero field algorithm is that the 
weight coefficient must be multiplied by the factor [1 — /fd(c)]- 

A(fc) 

With these modifications the steps of the algorithm are as follows: 



1. Set v = 0, w = 0. 

2. Select initial state k arbitrarily. 

3. Compute a sum of weights: w = in -I- [1 — fpD{£)][vj{k)/X{k)]. 

4. Select a free-flight time tf = — ln(r) / A(fc) and add time integral to estimator: 

V = u + wvitf or use the expected value of the time integral: 

V = V + w[vi/X{k)]. 

5. Perform scattering. If mechanism was isotropic, reset weight: w = 0. 

6. Continue with step 3 until N fc-points have been generated. 

7. Calculate component of zero field mobility tensor as yij = q{X)v /{kBTaN). 



5 Results and Discussion 



As the first example we calculate the doping dependence of the zero field mobility 
in silicon. The analytical band structure reported in [3] is adopted. The scat- 
tering processes included are acoustic deformation potential scattering, ionized 
impurity scattering [5] and plasmon scattering [1]. Ionized impurity scattering 
is treated as an isotropic process [4] which effectively reduces small-angle scat- 
tering. 

Fig. 2 shows two electron mobility curves obtained by the new zero field al- 
gorithm and the one particle Monte Carlo algorithm, respectively. The Pauli 
exclusion principle at low electric field has been included in the one particle 
Monte Carlo method using the Fermi-Dirac distribution. This leads to a prefac- 
tor (1 — fpoi^f)) for all scattering processes within the low field approach. The 
value of the electric field used for the one particle approach has been chosen to 
be 1 kV/cm. The other physical parameters of both algorithms are the same. 
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Fig. 2. Electron zero field and low field mobilities in Si. 



As can be seen from Figure 2, the curve obtained with the one particle Monte 
Carlo method has some variance at low and high doping levels while the zero 
field curve appears rather smooth. Moreover the calculation time for the new 
zero field algorithm is about 20% of the one particle Monte Carlo method. 

Fig. 3 shows the electron mobility as a function of doping concentration for 
GaAs obtained by both algorithms. The degeneracy effects are more pronounced 
in this semiconductor because of the smaller effective mass of electrons in the 
r valley. The absolute value of the electric field is the same as for silicon. As is 
seen from this figure, in addition to a high variance at low doping levels the one 
particle Monte Carlo gives an incorrect behavior of the mobility at high doping 
levels. This is related to the fact that the value 1 kV/cm of the electric field is 
still high from the viewpoint of using the Fermi-Dirac distribution with lattice 
temperature within the one particle algorithm. In order to obtain correct results 
at high doping levels by the one particle Monte Carlo method it is necessary 
to reduce further the magnitude of the electric field. However in this case the 
variance would increase considerably leading to an extremely long computation 
time. 



6 Conclusion 

A zero field Monte Carlo algorithm accounting for the quantum mechanical 
Pauli exclusion principle has been presented. The method has been derived from 
the integral representation of a linearized Boltzmann- like equation. It has been 
shown that particle trajectories are constructed in terms of a new scattering rate 
which in general represents a linear combination of the forward and backward 
scattering rates. It has been also pointed out that for energies below the Fermi 
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Fig. 3. Electron zero field and low field mobilities in GaAs. 



level kinetic properties are predominantly determined by the backward scattering 
rate while for energy levels above the Fermi level the forward scattering rate is 
dominant. In the latter case the non-degenerate zero field algorithm is recovered. 
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Abstract. Modeling the dynamics of infectious diseases, illicit drug ini- 
tiation, or other systems, where the age of an individual or the duration 
of being in a specific state is essential, leads to a generalized form of 
the McKendrick equations, i.e., a system of first-order partial differen- 
tial equations in the time-age space. Typically such models include age- 
specific feedback components which are represented by integral terms 
in the right hand side of the equation. This paper presents the general 
framework of such optimal control models, the corresponding necessary 
optimality conditions and two different approaches for numerical solution 
methods. 

1 Introduction 

The classical McKendrick equation is the basic instrument for modeling age- 
structured population dynamics (see [1, 5-7]). It is a first-order partial differential 
equation of the form 



where t denotes the time, a is the age, S{t,a) is the state variable (e.g., the 
number of people of age a at time t), and /i(a) is an age-dependent death rate. 
This equation can be solved easily by transforming it into a set of ordinary 
differential equations along the characteristics t — a = c 



where c represents the time of birth and = S{t, t — c). 

When modeling concrete problems, it is often the case that /i(a) depends also 
on aggregated state variables (e.g., the total population size). This fact leads to 
an integral term (integration w.r.t. age) in the right hand side of equation (1). 
This integral term couples the infinite set of ODEs in (2) . 

* This work was partly financed by the Austrian Science Foundation under contract 
No. 14060-OEK and by the Austrian Science and Research Liaison Office Sofia. 
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Furthermore, if optimal control models with distributed (age-specific) con- 
trol variables are considered, an analytical solution is hardly feasible. But it is 
possible to formulate the necessary optimality conditions and based on them, 
two different approaches for solving those problems numerically are presented in 
this paper. These two approaches are the following: 

Direct Method: This method is based on the discretization of the age param- 
eter. It leads to an ordinary boundary value problem which can be solved 
by a collocation method. 

Gradient Method: Another possibility is to apply an iterative method. Based 
on an initial estimate of the control variable the primal system is solved 
in the forward direction and using the results the dual system is solved in 
the backwards direction. The approximated solutions of the primal and the 
dual systems allow to calculate the anti-gradient of the Hamiltonian, which 
is used to improve the approximation of the control. 

2 Model Framework 

We consider a system S = {Si, . . . , Sn) of different states Si{t,x) with i = 
1, . . . , n depending on the time t and a second parameter x, which can either be 
the age or the duration in this state. In both cases x has the form x = t+c, where 
c indicates the time of birth (in the age case) or the time of state entrance (in 
the duration case). To distinguish between an age specific or a duration specific 
system, we introduce the age-duration-index 



There are different parameters describing the transition between the states, 
and into and out of the system: 

fJ-ij = ^J'ij{t, x,S,F,u) denotes the transition from state Si to Sj depending on 
time t, the parameter x, the state of the system S, the overall feedback F of 
the system, and the distributed control u(t,a;). 
ai = ai{t, X, S, F, u) represents the outffow from the state Si out of the system. 
A = f3i{t, X, S, F, u) is the inffow into the state Si from outside of the system. 

The dynamic of the system is represented by a McKendrick-like equation 



ADI = 



1 age-specific case 
0 duration-specific case 



( 3 ) 





( 4 ) 



J 



with boundary conditions 




(3^{t, X, S, F, u) -h (1 - ADI) ^ ^iji{t, X, S, F, u) (5) 



J 
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and initial conditions 

S,{0,x) = S^{x). (6) 

The difference between an age- and a duration-specific system is the way of 
entering a new state. In the age case the transitions into a new state are included 
in the partial differential equation (4), because the age is a preserved property. 
But in the duration case the transitions occur in the boundary condition (5), 
because the duration has to be reset, if the state changes. 

The feedback component F describes the current status of the system with 
respect to a certain aspect. The general form is: 

pU) 

F(t,a:, S)= / f (t, x, S(t, x'), u(t, x')) dx' (7) 

Jo 

As indicated, F is a vector function, so that it is possible to include different 
types of feedback. A simple example is that the feedback function consists only 
of the total amount in each state at time t (so that f = S). 

The restriction that x S [0,w] is chosen due to a numerical implementation 
and easier analysis. Furthermore, in real applications it is very unlikely to have an 
infinite age or duration range, (e.g., epidemiological models are often described 
using an infinite age range, but it is no restriction, if the age range is cut off at 
130 years for humans.) 

The control u(t, x) shall be optimized with respect to an objective function 
of the general form 



min J = 

ueu 




I (x, S(T, x)) dx -I- 





L{t, X, S, F, u)dxdt. 



where U is the set of feasible solutions. 



( 8 ) 



3 Necessary Optimality Conditions 

In [4] a maximum principle is described for a more general case than this type 
of models is. Applying this method leads to the following maximum principle 
(To increase the readability arguments of functions are not written. The missing 
arguments are all set along the optimal solution.): 

Maximum Principle 1 Ifu(t, x) is the optimal control for the dynamic system 
described through (3) - (8), then S(t,x) defined through (3) - (1), S(t,x) and 
f(t,x) defined through 



dSi{t,x) 



dSj{t,x) 

dx 




+ ^Sk{t,x) 

k 




adiJ2 



j 




dt 
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- E 0) ( + (1 - ADI) ^ I _ 









( 9 ) 



Q ^ ^ Q Q ^ Q 

k \ j j 

+ Y.s,{t,0) + ( 10 ) 

with boundary and terminal conditions 

S,{T,x) = -^nx,S{t,x)) S,{t,u;) = 0 (11) 

together with u(i, x) satisfy 

iJ(S,S,u) -i?(S,S,u) > 0 VueU (12) 

where the Hamiltonian is 



H —L + S'i I — iJ-ij — ai + ADI 

i \ j j 

pUJ 

+ / '^fi{t,x')fi{t,x',x)dx'+ 

Jo 



tlji 



+ E ^k{t, 0) j /3fe + (1 — ADI) ^ ^jk j . 



(13) 



This maximum principle can be derived easily by application of theorem 1 
in section 4 of [4] . 



4 Numerical Solution Algorithms 

The model equations (3)-(8) together with the dynamics of the adjoint system 
(9)-(13) form a system of first-order partial differential equations defined on a 
rectangle in the t-x-space. The primal system has initial and boundary condi- 
tions at t = 0 and a; = 0, whereas the dual system must fulfill terminal and 
boundary conditions at t = T and x = cu. Furthermore, the primal and the 
dual system have conflictive dynamics in the sense that the (numerical) sta- 
ble direction (increasing time) for solving the primal system is unstable for the 
dual system. In the following two different methods are described which consider 
those circumstances. 
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Direct Method 



In this context direct method means that an approximation of the solution of 
the problem is calculated in one step. The main idea is to transform the system 
of partial differential equations into a system of ordinary differential equations 
by applying the method of lines. In detail this means that the above equations 
are discretized with respect to age (duration) x by introducing an appropriate 
grid {xk)k=i,...,Nk- For each grid point a separate state vector as function of time 
is specified. The derivatives w.r.t. x are approximated using a finite difference 
formula of appropriate order, and the integration w.r.t. x is substituted using 
a quadrature formula (for details see [8]). At each grid point Xk the function 
describes the state i at this point. The state vector S{t,x) extends to a 
state matrix, one element for each state and each age: 



m 



(S\{t) Slit)--- 
Slit) Slit) 

V : 



According to that all functions are transformed into time-dependent grid func- 
tions. The ODE system corresponding to (4) can be formulated as follows: 

— — = - -^xiS-^\t)) ~'^^j,ijit,Xk, S,F,u) - ai(t,a;fc,S,F,u)-b 

3 

+ ADI^^j.jiit,Xk,S,F,u) (14) 

3 



with boundary conditions 

S^^\t) = a;fe,S,F,u) -I- (1 - ADI)'Y^^lj^it,Xk,S,F,u) 

and initial conditions 

^f^(O) = 5°(xfc), 



(15) 

(16) 



where Ax denotes an appropriate finite difference operator w.r.t. x and f2x is 
the quadrature operator. In the same way the feedback function F, the objective 
function J, and the adjoint system can be transformed. This procedure leads 
to an ordinary boundary value problem, where one half of the state variables 
has initial conditions and the other half has terminal conditions. The number of 
equations depends on the number of grid points chosen. 

As explained above, the primal and the dual system have a complementary 
stability behavior. Therefore, it is not possible to apply a shooting method for 
the solution of the problem. But a finite difference or a collocation method can 
be used to solve the boundary value problem, since these methods solve the 
problem over the whole range in one step (see [2]). This is done by discretizing 
the problem and getting a non-linear equation system. Unfortunately this non- 
linear equation system can become extremely large. Its size depends on the 
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number of states, on the number of grid points -/V^, and on the discretization of 
the time t. Hence, this method can only be applied to small and medium-sized 
problems. 

Gradient Method 

Another approach for a solution algorithm is to approximate the optimal control 
variable u(t, x) iteratively using the anti-gradient of the Hamiltonian function. 
The steps of this iteration are as follows: 

1. Choose a first estimate for the control Set k = 1. 

2. Get the solution S^^^of the primal system using . For solving the primal 

system the equations are transformed into an ordinary differential equation 
system either by discretizing x or by reformulating the problem along the 
characteristics (see equation (2)). 

3. Get the solution S^^^of the dual system using and The same 

method as in the previous step can be applied here, but the equations must 
be solved backwards (due to the terminal conditions). 

4. Now a direction for improving the control variable can be calculated using 
the anti-gradient of the Hamiltonian 

au 

If the (anti-)gradient is sufficiently small, the iteration stops and is 

the approximate solution. 

5. The new estimate can be calculated by performing a line search and 

minimizing the objective J along the line -|- For this step it 

is again necessary to solve the primal system for different control values u. 

6. Change fc to fc -|- 1 and proceed with step 2. 

For the gradient method it is only necessary to solve initial value problems, be- 
cause the adjoint system is solved in the backwards direction and the terminal 
conditions turn into initial conditions. The effort for solving initial value prob- 
lems is far less than solving boundary value problems, especially if the problems 
are large. Therefore, this method is preferable for models with many states and 
a long time horizon. 

5 Discussion and Conclnsions 

Both methods were implemented and several different models were solved (see 
[1,3]). In general the gradient method seems preferable, because it is faster and 
less memory consuming. The usual disadvantages are that the iteration can easily 
be caught in a local optimum and that the convergence speed depends strongly 
on the problem and it can be very low. 

On the other hand the direct method is more time-consuming, but usually 
more robust concerning the convergence of the collocation algorithm. Disconti- 
nuities of functions or high variations need a special handling, which means that 
appropriate fixed grid points have to be chosen in order to get convergence. 
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A combination of both algorithms may improve the results, if the direct 
method is used to calculate a quick first approximation and the gradient method 
is used for the fine tuning. But we did not test that possibility. 

Since a very general formulation of the model is used, a wide range of ap- 
plications can be realized. Also further extensions of the model framework are 
possible and partly already included in our programs: 

Different Types of Feedback: It is possible to implement a simple feedback 
of the state values (i.e., the transition functions depend on S{t, x)), an aggre- 
gated but age-independent feedback (i.e., using a function F(t) which does 
not depend on x), or an aggregated age-dependent feedback F(t, x). All three 
types can be used in one model. 

Different Types of Control: The model can include distributed, boundary, 
and non-distributed controls. With different types of controls, we also get dif- 
ferent Hamiltonian functions, and it is only a question of calculating the con- 
trols (for the direct method) or the anti-gradients (for the gradient method) 
based on that functions. 

The solution methods presented in this paper are formulated for a very general 
case. For concrete problems it is useful and sometimes necessary to adapt the 
methods and consider the special structure of the problem in order to improve 
the quality of the results and to reduce the computational time. 
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Abstract. Rational data fitting has proved extremely useful in a num- 
ber of scientific applications. We refer among others to its use in some 
network problems [6, 7, 15, 16], to the modelling of electro-magnetic com- 
ponents [20, 13], to model reduction of linear shift-invariant systems [2, 
3, 8] and so on. 

When computing a rational interpolant in one variable, all existing tech- 
niques deliver the same rational function, because all rational functions 
that satisfy the interpolation conditions reduce to the same unique irre- 
ducible form. When switching from one to many variables, the situation 
is entirely different. Not only does one have a large choice of multivari- 
ate rational functions, but moreover, different algorithms yield different 
rational interpolants and apply to different situations. 

The rational interpolation of function values that are given at a set of 
points lying on a multidimensional grid, has extensively been dealt with 
in ]11, 10, 5]. The case where the interpolation data are scattered in the 
multivariate space, is far less discussed and is the subject of this paper. 
We present a fast solver for the linear block Cauchy- Vandermonde sys- 
tem that translates the interpolation conditions, and combine it with an 
interval arithmetic verification step. 



1 Introduction 

In one variable the rational interpolation problem can be solved using the recur- 
sive technique developed by Bulirsch-Stoer, or a fast solver for the solution of the 
structured linear system defining the coefficients. The rational interpolant can 
also be obtained as the convergent of a Thiele interpolating continued fraction 
with inverse differences in the partial denominators. For multivariate generaliza- 
tions of the Thiele interpolating continued fraction approach we refer to [21,22, 
19, 12]. For a multivariate generalization of the Bulirsch-Stoer algorithm we refer 
to [9]. The structured linear system solver, which is developed here, delivers the 
coefficients of the rational interpolant, and hence an explicit representation of 
the rational function. 

Let the value of the univariate function f{x) be given in the interpola- 
tion points {xq, xi, X 2 , ■ ■ ■}, which are non-coinciding. The rational interpolation 
problem of order (n, m) for / consists in finding polynomials 
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n 



m 



p{x) = ^ a^x\ q{x) = ^ b^x\ 



2=0 2=0 
with p(x)/q{x) irreducible and such that 

P 

f{xi) = -{xi), i = 0,...,n + m. 

In order to solve (1) we rewrite it as 

f{xi)q(x^) - p{xi) = 0, i = 0 , . . . , n + m . 



( 1 ) 



( 2 ) 



Condition (2) is a homogeneous system of n + m + 1 linear equations in the 
n + m + 2 unknown coefficients and hi of p and q and hence it has at least one 

nontrivial solution. It is well-known that all the solutions of (2) have the same 
irreducible form and we shall therefore denote by 



the irreducible form of p/q with p and q satisfying (2) where g* is normalized 
according to a chosen normalization. We say that “interpolates” the given 
function and by this we mean that p* and q* satisfy some of the interpolation 
conditions (1). This does not imply that r„^m actually interpolates the given 
function at all the data because, by constructing the irreducible form, a com- 
mon factor and hence some interpolation conditions may be cancelled in the 
polynomials p and q that provide rn,m- Since rn^m is the irreducible form, the 
rational functions p/q with p and q satisfying (2) are called “equivalent”. If the 
rank of the linear system (2) is maximal then = P* /q* = p/q- 

Let us take a closer look at the linear system of equations (2), defining the 
numerator and denominator coefficients ai and hi. In the sequel we assume, for 
simplicity but without loss of generality, that this (n -|- m -I- 1) x (n -I- m -I- 2) 
homogeneous linear system of equations can be solved for the choice 6 q = 1 . 

The concept of displacement rank was first introduced in [18]. We use the 
definition given in [14] where the displacement rank a of an (n-l-m-l-1) x (n-l-m-l-1) 
matrix A is defined as the rank of the matrix LA — AR with L and R being 
so-called left and right displacement operators. If ^ is a Cauchy- Vandermonde 
matrix, as in (2) after choosing fy = 1, and if all Xi fy 0 and all \xi\ fy 1, 
then suitable displacement operators are given by L = diag (l/a;i)^^Q and 




RT = 0 with 



/O . . . 0 w\ 

1 0 ... 0 



4 “^ = 0 1 ■ ■ . 



Vo--- 0 1 0/,^, 
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The resulting matrix LA — AR then takes the form 

/ /o(l-x^) 0...0 -(1/xo-x^) 0...0^ 

LA- AR= : : 

\/n+m(l ~ Xj^^) 0 ... 0 —{1/Xn+m ~ x"_|_jfj) 0 . . . Oy 



Hence the displacement rank a oi A equals a = 2. When a factorization [14] 
LA- AR=GB, Q ^ (^{n+m+l)xa ^ ^ g ^ax(n+m+l)^ 

is known, then an LU factorization of the Cauchy-like matrix A = A{Q^^ 0 
(the superscript ^ denotes complex conjugation and transposition) 
where the columns of the matrices Qi^m and Qi^n+i contain the eigenvectors 
of for which explicit formulas are known, can be obtained from [14] with 
order of complexity 0(2{n + m+ 1)^). 



2 Multivariate Rational Interpolation 

Although the situation between one and more variables is substantially different, 
there is no loss in generality by describing the bivariate case instead of the general 
higher-dimensional case. Let the bivariate function f{x,y) be given in the set 
of points {(xfc,j/fe) \ 0 < k < n + m} and let us assume that none of the points 
{xk,yk) coincide. Let N and D be two finite subsets of with which we associate 



the bivariate polynomials 






P{x,y)= ^ aijx'y^ 


#N = n + 1, 


N from “numerator”, 


Q{x,y)= X] 


#D = m + 1, 


D from “denominator” 



The multivariate rational interpolation problem consists in finding polynomials 
p{x,y) and q{x,y) with p(x,y)/q{x,y) irreducible such that 

P 

f(xk,yk) = -(Xfe, yk), fc = 0, . . . , n + m . 

In applications where adaptive sampling is used and data points are placed at 
optimally located positions, it is an exception rather than the rule that some 
data points have the same x- or y-coordinates. Hence techniques available for a 
grid-like set of data points, such as in [21, 5, 10, 22, 19, 4] cannot be used. In the 
sequel we shall deal with the more general and less-studied multivariate situation 
where the dataset is not necessarily grid-structured. We do however require that 
the sets N and D satisfy the inclusion property, which is not a serious restriction. 
The problem of interpolating the data by p{x,y)/q{x,y) is reformulated as 

/(xfc,2/fc) =0> k = 0,...,n + m . 

\{kj)&D J / 

( 3 ) 

The set of polynomial tuples (p,q) satisfying (3) is denoted by [N/D]^_^^. 
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3 Fast Block Cauchy-Vandermonde Solver 

Let us introduce the notations 

= max{i I (i,j) G N}, 
j(^) = max{j I (i,j) € N}, 

= max{i I (i,j) G D}, 
j(^) = max{j I (i,j) G £>} . 

Let max(I^^\ and let us introduce the shorthand nota- 
tions V = and 5 = . Since both N and D satisfy the inclusion property, 

we can decompose the sets as follows: 

TV = U • • • U 

= {(b j) I 0 < j < = max{j | (i, j) e iV}, 

H U • • • U 

= {(bj) I 0 < j < = max{j | (i, j) e D} . 

If max(j(^\ we proceed analogously. If max(/^^\ 

I(d)^j{d)) 

is attained in the i-direction instead of in the j-direction, then the 
sets N and D are decomposed horizontally instead of vertically. We point out to 
the reader at this point that the sequel does not apply if both the sets N and D 
are not decomposed in the same way, either both horizontally or both vertically. 



GOO 
GOO f{{i) 



mW ^ -o- 

G O O 
GOO 
O O O 



9' O O 
©: O O 
0: O O 



m = u 




Fig. 1. Breaking up N and D 

Using these notations, we arrange the unknown coefficients aij and bij as 



B^ = 


O 

O 




■ • ’ • ■ 


• - ■ 




A^ = 


(o,oo, ■ 




• ■ ’ 1 • 


■ • 1 ■ 
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Introducing the matrices of size (n + to + 1) x + 1) and the matrices 

of dimension (n + to + 1) x + 1), respectively given by 




the system of interpolation conditions (3) looks like 




Let all XiUj ^ 0. Again, without loss of generality, we solve this system with 
6 qo normalized to 1. We denote by A the coefficient matrix of the square in- 
homogeneous linear system, which results from removing the first unknown 
from B and the first column from . In order to generalize the fast Cauchy- 
Vandermonde solver, we choose a f 1 if any of the \yk\ is equal to one in case 
max(/(^), is either or or any of the \xk\ is equal to 

one in case is either or and introduce 



can be factored as 

LA- AR= (Gi Gz) B (4) 

where the (n -I- to -I- 1) x (i5 -I- 1) and {n + m + 1) x {v + 1) submatrices Gi and 
G 2 are given by 



/O . . . 0 1\ 



0 . . . 0 



Zk^ = 0 i; 



Vo... 0 . 0/,^, 



With 





it is easy to see that the resulting (n-|-TO-|-l) x (n-|-TO-|-l) matrix LA — AR 




and the matrix B consists of zeroes with the exception of the following entries 
which equal one: 
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£= 0 : 
£=!,.. .,5: 

^=Q: 



column number 1, row number 1, 
l. 

column number 

i=l 

row number ^ + 1, 

column number m + 1, row number <5 + 2, 

f. 

column number (m + 1) + E (mS + 1) 

i=l 

row number ((5 + 1) + (^ + 1) . 



When max(I^^\ is either or and N and D are being 

decomposed horizontally, then L is replaced by L = diag n+m- From 

the factorization (4) for LA — AR, and from the factorizations 

Zk^ = ^ Q?i.,kDi/.,kQi/.,k = V gf/,,, diag gi/„,fe 



where the eigenvalues are the k complex zeroes of vz'^ = 1 and the columns 
of the unitary matrix Qi/y^k £^re the eigenvectors of we obtain with 






0g 



H 

1 / v , m [^''+1 j ’ 



,i=0 






the factorization (the superscript ^ denotes transposition and complex conjuga- 
tion) 

L(AF^) - (AF^)vD^ = L(AF^) - ARF^ 

= {LA - AR)F^ 

= G{BF^), G=(GiG2) . (5) 

Since the matrix AF^ is a Cauchy-like matrix, the technique proposed in [14] 
can be applied. It incorporates partial pivoting while its complexity of 0{{S + 
^ -I- 2)(n -I- TO -I- 1) ) is noticeably smaller than that of the classical Gaussian 
elimination. The latter is achieved because the technique exploits the structure 
of the coefficient matrix AF^ by constructing the LU factorization of AF^ from 
the knowledge of the matrix factors G and B in the factorization (5). 



4 Complexity, Stability and Reliability 

The procedure to compute the LU factorization of AF^ directly from the matrix 
factors G and B is detailed in [14]. Roughly speaking all entries in the LU 
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factors can be computed from the scalar products of the rows in G and the 
columns in B, and the differences of the entries in the left and right displacement 
operators L and vD^ . Having the LU decomposition of AF^ at our disposal, 
an approximate inverse W of A can be computed (remember that FF^ = J) 
and the following interval arithmetic verification step can easily be performed. 
We again for simplicity assume that the system of interpolation conditions (3) 
can be solved with the choice boo = 1- We denote the coefficient matrix resulting 
from the choice boo = 1 by A, the righthand side of the square inhomogeneous 
linear system by c and the computed floating-point solution of (AF^)Fx = c by 
means of the fast technique described in [14] by Fx. The fixpoint of the iteration 
function 

/(e) = W{c - Ax) + {I- WA)e 

is the defect vector e = x — x where x is the exact solution of Ax = c [17]. If 
J-{E) denotes its interval extension and if for some interval E, 

F{E) C E 

where E denotes the interior of the interval E, then the linear system Ax = c 
has one and only one solution in the interval x + E. 

For classical Gaussian elimination with partial pivoting performed on the 
full matrix A instead of on the factors G and B of AF^ , the error in x, say the 
width of E, is typically of the order of the product of the condition number of 
A and the machine epsilon where f3 and t respectively denote the radix 

and precision of the floating-point system in use. In Table 1 we illustrate that 
the fast Gaussian elimination with partial pivoting performed on the factors G 
and B enjoys the same property, under the condition that (6) is not too small. 
This is in fact an optimal result for a fast linear system solver. In Table 1 the 
value diam(£i) with E = (i?i, . . . , En+m+i) is defined by 



n+m+1 



diam(£i) = 






diam(i?i)^ 



i=l 



Let us denote the matrix elements in the factors G and B of (5) by G = 
( 7 y) and B = (/3y). According to [1], instabilities can occur if the size of 

is small compared to that of the elements 



the matrix elements 

^S+v+2 






X)fe=i \lik \ ■ |/3fej|- Therefore, in the Tables 3 and 4, the value 



min — 



( 6 ) 



is tabulated together with the evaluations of each scattered rational interpolant, 
the norms of the residue r and normalized residue r-norm, and the £2 condition 
number K 2 {A) of the square matrix A. 
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Table 1. Conditioning and stability (IEEE standard double precision) 



k dim(A) k, 2 {A) diam(_E) 



1 


6 1.4e+02 


2.3e-13 


5 


24 3.8e+03 


9.6e-12 


9 


58 1.9e+06 


7.2e-09 


13 


105 1.2e+09 


3.5e-06 



Table 2. Exact values of the Beta function 



x\y -0.75 -0.25 +0.25 +0.75 

-0.75 +9.88839827894 +0.00000000000 +4.94419913947 +0.00000000000 
-0.25 +0.00000000000 -6.77770467835 +0.00000000000 -3.38885233918 
+0.25 +4.94419913947 +0.00000000000 +7.41629870921 +4.44288293816 
+0.75 +0.00000000000 -3.38885233918 +4.44288293816 +1.69442616959 



Table 3. Scattered interpolant 



k 


(6) 


k 2 


II norm || 2 


K2{V) 


11 


l.Oe+00 


2.8e-14 


2.7e-15 


6.2e+08 


x\y 


-0.75 


-0.25 


+0.25 


+0.75 


-0.75 +9.88685119987 +0.00224594375 +4.94385076491 +0.00009350063 


-0.25 - 


-0.00098402596 


-6.77769745504 +0.00001290232 - 


-3.38885282711 


+0.25 +4.94390538464 


-0.00000679165 +7.41629926989 +4.44288258156 


+0.75 +0.00030912310 


-3.38885313920 +4.44288082939 +1.69442607545 


Table 4. Scattered interpolant [Nis/ 


k 


(6) 


k 2 


II ^ norm || 2 


K2{V) 


15 


l.Oe+00 


2.3e-13 


1.6e-14 


3.2e+ll 


x\y 


-0.75 


-0.25 


+0.25 


+0.75 



-0.75 +9.88838464368 +0.00000158653 +4.94419901680 -0.00000021220 
-0.25 +0.00000233463 -6.77770469003 -0.00000000786 -3.38885234732 
+0.25 +4.94419930347 +0.00000000528 +7.41629870926 +4.44288293815 
+0.75 -0.00000220603 -3.38885234095 +4.44288293796 +1.69442616960 
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As an illustration of the technique, we apply the structured Cauchy- Vander- 
monde solver to the computation of some bivariate rational interpolants from 
scattered data obtained from the Beta-function 



B{x,y) 



mny) 

r{x + y) 



where F denotes the Gamma function. The Beta function is an interesting ex- 
ample because of its meromorphy. By means of the recurrence formulas 



F{x+1) = xF{x), r{y+l) = yF{y), 



for the Gamma function, we can write 

B(i, y) = l + (»-l)(!/-l)/("-l.i/-l) , 

xy 

If we approximate the function f{x — l,y — 1) by a rational interpolant 
[N/ D]j(x,y) and plug this approximant into (7), we obtain a rational approxi- 
mant for B{x,y). Because of the location of the poles of B{x,y) a,t x = —k and 
y = —k for k = 1 , 2 ,... and its zeroes at a; -|- y = — ^ for £ = 0, 1 , 2 , . . ., we choose 

i7 = {(0,0), (1,0), (0,1), (1,1)}, 

Nk = \ 0 <i + j <k}, fc = l,2,..., 

4t^Ik = {k l)(fc -|- 2)/2 -|- 3 . 

With this choice of index sets, max(/(^'=), j(^)) = 

and S = l,m + l = 4, i/ = k,n+ 1 = {k + l)(fc -|- 2)/2. For the interpolation 
points indexed by Ik we choose randomly generated tuples in the domain [—1, 1] x 
[—1, 1]. The displacement rank of AF^ equals S+iy+2 = k+3. Since n « jz^/2 « 
fc^/2, the technique described in Sect. 3 significantly reduces the complexity when 
k is larger. 

All rational interpolants are evaluated in the 16 points 



{-0.75,-0.25,0.25,0.75} X {-0.75,-0.25,0.25,0.75} . 



The exact value of the Beta function in these 16 points can be found in Table 2. 
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Abstract. We consider an optimal distributed control problem involv- 
ing semilinear parabolic partial differential equations, with control and 
state constraints. Since no convexity assumptions are made, the prob- 
lem is reformulated in relaxed form. The state equation is discretized 
using a finite element method in space and a S-scheme in time, while 
the controls are approximated by blockwise constant relaxed controls. 
The first result is that, under appropriate assumptions, the properties 
of optimality, and of extremality and admissibility, carry over in the 
limit to the corresponding properties for the relaxed continuous prob- 
lem. We also propose progressively refining discrete conditional gradient 
and gradient-penalty methods, which generate relaxed controls, for solv- 
ing the continuous relaxed problem, thus reducing computations and 
memory. Numerical examples are given. 



1 Introduction 

Optimal control problems, without strong and often unrealistic convexity as- 
sumptions, have no classical solutions in general. For this reason they are refor- 
mulated in the so-called relaxed form, which in turn has a solution under mild 
assumptions. Relaxation theory has been extensively used to develop not only 
existence theory, but also necessary conditions for optimality and optimization 
methods and approximation methods (see [1-10], and references in [8]). Here we 
consider an optimal distributed control problem involving semilinear parabolic 
PDF’s, with state constraints. Since no convexity assumptions are made, the 
problem is reformulated in relaxed form. The state equation is then discretized 
using a finite element method in space and a 0-method in time (extending the 
method in [2] and [5]), while the controls are approximated by blockwise con- 
stant relaxed controls. The first result is that, under appropriate assumptions, 
the properties of optimality, and extremality and admissibility, carry over in the 
limit to the corresponding properties for the relaxed continuous problem (ex- 
tending the results in [2]). In addition, we propose progressively refining discrete 
conditional gradient and gradient-penalty methods, which generate relaxed con- 
trols, for solving the continuous relaxed problem (extending [5]). The refining 
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procedure has the advantage of reducing computing time and memory. On the 
other hand, the use of relaxed controls exploits the nonconvex structure of the 
problem. Finally, two numerical examples are given. 



2 The Continuous Optimal Control Problems 

Let 17 be a bounded domain in with a Lipschitz boundary F, and let / = 
(0, T). Consider the following semilinear parabolic state equation 
yt + A{t)y = f{x,t,y{x,t),w{x,t)), in Q = C x /, 
y(x,t) = 0, in 17 = F X I, and y{x,Q) = y^{x), in 17, 
where A{t) is the second order elliptic differential operator 

A{t)y = - E E {d/dxi)[a^j{x,t){dy/dxj)]. 

j=l i=l 

The constraints on the control variable w are w{x,t) C U in Q, where U is 
a compact, not necessarily convex, subset of . The constraints on the state 
and the control variables are 

Jmiw) = Jq gm{x, t, y, w)dxdt = 0, 1 < m < p, 

Jm{w) = fg gm{x, t, y, w)dxdt <0, p < m < g, 
and the cost functional to be minimized 
Jo{w) = jQgo{x,t,y,w)dxdt. 

Define the set of classical controls 

W = {w : (x, t) I— > w{x, t) I w measurable from Q to U}, 
and the set of relaxed controls (Young measures, for relevant theory, see [8,9]) 

by 

i? := {r : Q — > Mi{U) |r weakly measurable} 

CL^iQ,M{U)) = L\Q,CiU))r 

where Mi{U) is the set of probability measures on U. The set W (resp. i?) is 
endowed with the relative strong (resp. weak star) topology, and R is convex, 
metrizable and compact. If we identify each classical control w{-) with its asso- 
ciated Dirac relaxed control r(-) := E(-)j then W may be considered as a subset 
of R, and W is thus dense in R. For given (j) G L^{Q, C{U)) and r G R, we write 
for simplicity 

4>{x,t^r{x,t)) = Jjj 4>{x,t,u)r(x,t){du) . 

The relaxed formulation of the above control problem is the following 
<yt,v> +a{t, y, v) = f{t, y{x, t),r{x, t))v{x)dx, 
for every v G V, a.e. in I, 

y{x,t) = 0 in E = r X I and y(cc,0) = y^{x) in 17, 
where V = 77q(17), a{t,-,-) is the usual bilinear form associated with A{t), 
< •,• > the duality between V and V*, with control constraint r G R, state 
constraints 

Jm{r) = jQgm{x,t,y,r{x,t))dxdt = 0, I < m < p, 

Jm{r) = jQgm{x,t,y,r{x,t))dxdt <0, p < m < q, 
and cost to be minimized 

= jQgQ{x,t,y{x,t),r{x,t))dxdt. 
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We suppose in the sequel that / is Lipschitz and sublinear w.r.t. y, the 
functions gm are subquadratic w.r.t. y, and gmy sublinear w.r.t. y. 

Theorem 1. The mappings r i—y y G L‘^{Q) and r i— > Jm{f) are eontinuous. 

Theorem 2. If the relaxed problem is feasible, then it has a solution. 

Since W C i?, we have in general 

minJo(r)< inf Jo{w), 
r^R w£W 

and if there are no state constraints, since W is dense in R 

minJo(r)= inf Jo(w). 
reR wew 

It can be shown (see [2]) that, for given controls r, r' S R, the directional 
derivative of the functional J (we drop the index m) is given by 
DJ{r, r' — r) = fg H{x, t, y, z, r' — r)dxdt, 
where the Hamiltonian is 

H{x,t,y,z,u) = zf{x,t,y,u) + g{x,t,y,u), 
and the general adjoint z satisfies the equation 
-zt + A*{t)z = fy{y,r)z + gy{y,r), in Q, 
z{x, t) = 0, \n E = r X I, and z{x, T) = 0, in fi. 

We have the following relaxed necessary conditions for optimality. 

Theorem 3. If r is optimal, then it is extremal, i.e. there exist multipliers Aq > 
0, Am S m, 1 < m < p, Am > 0, P < rn < q, not all zero, sueh that 

9 

\mDJm{r, r' -r)> 0, for every r' G R, 

m—0 

^mJm{r) = 0, p < m < q (Transversality conditions) , 

9 

where g is replaced by ^ Xmgm in the definitions of z, H, or equivalently 

m—0 

H(x,t,y, z,r) = min H(x,t,y, z,u), a.e. in Q (Minimum principle), 

u^U 

>^mJm{r) = 0, p < m < q. 

Theorem 4. The mappings r z G L‘^{Q) and {r,r') i— s- DJ{r,r' — r) are 
continuous. 

3 Discretization 

The set 17 is supposed here to be a polyhedron, for simplicity. For each n > 0, 
we choose a finite element discretization, with piecewise affine basis functions 
of a subspace F" of V, w.r.t. an admissible and regular triangulation T” of 17 
into simplices Sf, i = = M", of maximum diameter h”, and a time 

partition P" of I into equal subintervals If, j = 0, ..., N — 1, of length The 
set i?" of discrete relaxed controls is the set of relaxed controls that are equal to a 
constant measure in Mi{U) on each block Sfxif. Consider the following discrete 
problems, with state equation defined by the implicit 0-scheme (1/2 < 0 < 1) 
(l/Z\f")(y//_i - yf,v) + a(y^,v) = {f{t’);,yf,rf),v), for every v G V^, 
j = 0,...,7V- 1, yf given. 
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t^ = {I- 0)t^ + , %" = (!- e)y'- + , 

control constraint discrete functionals 

N-l 

= E In9m{tg,yg,r^)dx 0 < m < q, 

3=0 

either of the two cases of perturbed discrete equality constraints 
case (i) | J” (r’")| < 1 < < P, or case (ii) 1 < ™ < P, 

perturbed discrete inequality constraints 

<61^, P<m<q, 

where > 0, I < m < q, and cost functional to be minimized 

Theorem 5. The mappings r" y” and r" i— > J^{r^) are continuous. 

Theorem 6. If the discrete problem (case (i) or (ii)) is feasible, then it has a 
solution. 



Dropping m, the general discrete adjoint equation is here 
-{l/At){zf+,-zf, v)+a{v, z^_g) = (fy(q, rf)z^_g, + rf), v), 

for every v G V^, z^ = 0, 
and the directional derivative of J" 

N-l 

DJ^{r^, r'" - r-'*) = At^ H{t^,y'f, zf_g, r'p - r(f). 

3=0 

Theorem 7. (Constraint case (ii)). If r^ is optimal, then it is extremal, i.e. 
there exist multipliers, similar to the continuous case, such that 

i -r”) > 0, for every r'" G R, A" J"(r") = 0, p < 

m—0 

m < q, 

9 

where g is replaced by ^ ^m9m in z, H , or equivalently 

m—0 

J^H{x,tO,yO,zO_g,rf)dx = min H{x,f^ ,y^, z(^_g,u)dx, j = 0,...,N-1, 



\mJmi.r) = 0, p <m < q. 



Theorem 8. The mappings r" z'^ , (r",r'”) DJ(f{r^,r'^ - r") are con- 

tinuous. 



4 Behavior in the Limit 

Theorem 9. (Control approximation) For each r G R, there exists a sequence 
(r" G i?") that converges to r in R (the controls r" can even be chosen to be 
blockwise constant classical ones). 

We suppose that y)) — > in 17 strongly. In addition, if 6 = 1/2, we suppose 

that At^ < c{h'^Y , for an appropriate c. 

Theorem 10. (Consistency) If r” — > r, r'” — > r' in R, then 

y^^y, z"^ 2 , ^ J^(r), DJ((ir^,F^-r^)^ DJm{r,r'-r). 
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The two following results concern the behavior in the limit of the properties 
of optimality, and of extremality and admissibility. 

Theorem 11. We suppose that the continuous relaxed problem is feasible. Let 
f be some optimal control for the continuous relaxed problem, and (f") any 
sequence converging to r. Let (r”) be a sequence of optimal controls for the 
discrete problems (constraint case (i)), where the are such that > 0, as 
n —>■ oo, and satisfy the admissibility conditions 

< <5^, 1 < m < p, and p < m < q. 

If Af^ is sufficiently small, then every accumulation point of {r'^) is optimal for 
the continuous relaxed problem 

Theorem 12. Let (r") be a sequence of extremal and admissible controls for 
the discrete problems (constraint case (ii)), where — > 0 as n — > oo, and the 

perturbations satisfy a (constructive) minimum feasibility condition (see [4-]). 
If At^ is sufficiently small, then every accumulation point of (r") is extremal 
and admissible for the continuous relaxed problem. 



5 Discrete Optimization Methods 

Using the above Theorem 12, one can directly apply some optimization method 
to the discrete problem (case (ii)), for some fixed, sufficiently large, n. Here, we 
propose relaxed discrete, progressively refining, conditional gradient and condi- 
tional gradient-penalty methods, where the discretization is refined according 
to some convergence criterion. This refining procedure has the advantage of re- 
ducing computations and memory and completely avoiding the consideration of 
discrete problems and the computation of their perturbations 

We suppose that, for every n, either = T" and = P", or each 

simplex is a subset of some Sf and each interval a subset of some If. 
Let 1 < m < q, he nonnegative increasing sequences, such that > oo 

as n — > oo, and define the penalized discrete functionals (we use here the common 
index n for discretization and penalization) 

J"(r) = Jf{r) + i{ E M(f[J(f{r)f + ± M)^[max(0, J(f(r))Y). 

m=l 

Let 6, c G (0,1), (7”), (C")j with Cf^ < 1, positive decreasing sequences con- 
verging to zero, and consider the following algorithm. 

Algorithm: Relaxed conditional gradient /gradient-penalty method 

Step 1. Set n = 0, fc = 0 and choose r° G 
Step 2. Find such that 

4 = P J-(r", f- - rl) = min PJ-(r-, r'" - r^). 

Step 3. If |c?fe| < 7”, set r” = r^, f" = , d'^ = dk, n = n + 1 and go to Step 2. 

Step 4. Either 

(a) (Optimal step option) Find Ofc G [0, 1] such that 
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J”(r^ + afc(r^ - r^)) = min + a(r^ - r^)). 

ae[0,l] 

or 

(b) (Armijo step option) Find the smallest positive integer s such that 
ak = C”c® satisfies the inequality 

+ afc(f^ - r^)) - J"(r)?) < a^bd^. 

Step 5. Choose any S i?” such that 

< J"(r^ + ak{r^ - r^)). 

Set = r'l^, k = k + 1 and go to Step 2. 

The Armijo step option is a finite procedure and is practically faster than 
the optimal step option (use of golden section search). If there are no state con- 
straints, we set gm = Q, 1 < m < g, and the above method reduces to a relaxed 
conditional gradient method. In the state constrained case (conditional gradient- 
penalty method), using the sequence generated in Step 3 of the algorithm, we 
define the sequences of multipliers 
Am = ^<m<p, 

Am = AF^max(0, J^(r-„)), p < m < q. 

Theorem 13. (A) (Conditional Gradient Method) Every accumulation point 
of the sequence (r") generated by the algorithm in Step 3 is extremal for the 
continuous relaxed problem. 

(B) (Conditional Gradient- Penalty Method) Let (r'^)neK be a subsequence of 
the sequence (r") generated by the algorithm in Step 3, which converges to some 
control r G R. 

(i) If the sequences {X)^)n^K, 1 < m < 9, o.i'e hounded, then r is admissible 
and extremal for the continuous relaxed problem. 

(ii) Suppose that the continuous relaxed problem has no admissible abnormal 
extremal controls (see [9]). If r is admissible, then the sequences {X((,)neK are 
bounded and r is extremal for the continuous relaxed problem. 

Proof. (For the optimal step option) Suppose that n remains constant after some 
iterations. Then the penalty factors Mff remain also constant. By construction, 
dk < 0. Let > f" and — *■ f" be convergent subsequences {k G L), and 

suppose that 

d:= lim4 =limDJ”(r^,f^ -r^) = £» J"(f", f" - f") < 0. 

k k 

Using the Mean Value Theorem, we have 

J"(r^-bo;(f”-r^))- J"(r^) = aDJ^{r^ + pairO-r((),f^-r^) = a{d+eka), 

for a € [0, 1], where Eka — *■ 0 as A: — > oo and a — > 0. Hence 

J^{r^ + a{fl-r^))-J-{rl)< ad/2, 

for a G [0, (5] (i5 > 0), fc > ko, k G L. Hence, by Step 4 (a), we have 

T"(^fc + Mr/: - rO)) - J'^irl) = min [J"(rj? + a(fj? - r)?)) - J"(r^)] < 

OiG [0,1J 

5d/2 < 0, 

for k > ko, k G L. This implies that J'^fr'^) —oo, k G L, which contradicts 
the convergence J"(r^) ^ J"(f"), k G L, due to the continuity of J" (Theorem 
5). Therefore d := lim4 = 0 and we must have n —>■ oo. 
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Now, let any r' € R, and r'" ^ r' , n G K. By Steps 2 and 4 
(1) - r") = D {r'^ , - r") + J” (r’", r'" - r’") > d". 

m=l 

If each sequence (A” ) is bounded (and we can suppose that A((, ^ A^), we have 
J{r) = lim = liniA((j/M” =0, 1 < to < p, 

n n 

max(0, J(r)) = lim max(0, J^(r-”)) = limA((j/M^ = 0, p+l<TO<g, 

n n 

i.e. r is admissible. Passing to the limit in inequality (1), we obtain (Theorem 
10) 

q 

DJo(t’, r' — r) -\- ^ J^(r, r' —r)> 0,for every r' € R. 

m—l 

If Jmir) < 0 for some p < m < q, then clearly A((, = 0 for large n, hence Am = 0. 
Therefore, r is extremal. 

If r is admissible, suppose that A((, ^ oo for some to. Dividing inequality (1) by 
max I A” I, setting/i((j ■= max |A((,|, and passing to the limit in (1), we find 



X) XmDJm{r, r' -r) > 0,for every r' G R, 

m—l 

and we easily see that r is abnormal extremal, which contradicts our assumption. 

□ 



For the implementation of algorithms generating relaxed controls, see [5]. Ap- 
proximate discrete relaxed controls computed by the above algorithm can then 
be simulated by piecewise constant classical controls using a simple approxima- 
tion procedure (see [2]). 



6 Numerical Examples 

a) Let n = (0, 7t), / = (0, 1), Q = 17 x I, and define the state 
y(cc, t) = — sin x + x{tt — cc)/2, 
and the control 

w{x, t) = 1/2 +t, t G [0, 1/2) 
w{x, t) = 1, t G [1/2, 1] 

Consider the following optimal control problem, with state equation 
Vt - Vxx = ^ + w - w + y - y, (x,t) G Q, 

y{0,t) = y{TT,t) = 0, t G I, y{x,0) = —sinx + x{tt — x)/2, x G 
(nonconvex) control constraints 

rc(a;, t) G 17 := [— 1, 0] U {!}, (x,t) G Q, 

and (nonconvex) cost functional 
Jo{w) = jTq [i(y - y)^ - w‘^]dxdt. 

It can be verified that the optimal relaxed control is given by 

r*{x,t){{l}) = l/2 + t, r*(a:, !)({-!}) = 1 - r*(a:,t)({l}), tG [0,1/2) 
(non-classical part), 
r*(x,t)({l}) = l, tG [1/2,1] 

(classical part), 

with optimal state y* = y, and optimal cost J{r*) = — tt. Since the optimal 
relaxed cost value — tt cannot be attained by classical controls due to the con- 
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straints, but can be approximated as close as possible since W is dense in i?, the 
classical problem cannot have a solution. 

After 100 iterations of the relaxed conditional gradient method, with optimal 
step option, 0 = 0.52 and with three successive pairs of step sizes h = tt/30, tt/60, 
tt/120. At = 1/10, 1/20, 1/40, we obtained 
Jo(r") = -3.141574, |d”| = 0.602 • 10-^. 

b) Modifying the main state equation 

yt-Vxx = w + y-y, 

and imposing the additional equality state constraint 

J\{w) = jQy(x,t)dxdt = 0, 

the relaxed conditional gradient-penalty method with optimal step yielded, after 
100 iterations totally (i.e. in fc), the results 

Jo(r") = -2.799440, Ji(r’") = 0.464 • lO-^, |d"| = 0.425 • lO-^. 

References 

1. Chryssoverghi, I., Bacopoulos, A.: Discrete approximation of relaxed optimal con- 
trol problems. J. Optim. Theory and AppL, 65 (1990) 395-407 

2. Chryssoverghi, L, Bacopoulos, A.: Approximation of relaxed nonlinear parabolic 
optimal control problems. J. Optim. Theory and AppL, 77 (1993) 31-50 

3. Chryssoverghi, L, Bacopoulos, A., Kokkinis, B., Coletsos, J.: Mixed Frank-Wolfe 
penalty method with applications to nonconvex optimal control problems. J. Op- 
tim. Theory and AppL, 94 (1997) 311-334 

4. Chryssoverghi, I., Bacopoulos, A., Coletsos, J., Kokkinis, B.: Discrete approxi- 
mation of nonconvex hyperbolic optimal control problems with state constraints. 
Control & Cybernetics. 27 (1998) 29-50 

5. Chryssoverghi, L, Coletsos, J., Kokkinis, B.: Discrete relaxed method for semilinear 
parabolic optimal control problems. Control & Cybernetics. 28 (1999) 157-176 

6. Chryssoverghi, I., Coletsos, J., Kokkinis, B.: Approximate relaxed descent method 
for optimal control problems. Control & Cybernetics. 30 (2001) 385-404 

7. Roubicek, T.: A convergent computational method for constrained optimal relaxed 
control problems. J. Optim. Theory and AppL, 69 (1991) 589-603 

8. Roubicek, T.: Relaxation in Optimization Theory and Variational Calculus. Walter 
de Gruyter, Berlin. (1997) 

9. Warga, J.: Optimal Control of Differential and Functional Equations. Academic 
Press, New York. (1972) 

10. Warga, J.: Steepest descent with relaxed controls. SIAM J. on Control. 15 (1977) 
674-682 



Stabilizing Feedback of a Nonlinear Biological 
Wastewater Treatment Plants Model 



Neli Dimitrova and Mikhail Krastanov 



Institute of Mathematics and Informatics, Bulgarian Academy of Sciences 
Acad. G. Bonchev str. bl. 8, 1113 Sofia 
nelid@bio.bas .bg, krast@math.bas .bg 



Abstract. The present paper deals with a nonlinear anaerobic digester 
model of wastewater treatment plants. The asymptotic stabilizability of 
this control system is studied under suitable assumptions and a smooth 
stabilizing feedback is proposed. 



1 Introduction 



The last decades have been marked by a rapid interaction between mathematics 
and biology, forcing the implementation of new mathematical methods in the 
biological research. Typically, problems arising in biological application involve 
uncertain data in the form of intervals [5,6]. The development and implemetation 
of the new methodology for treatment of problems, which data are “unknown but 
bounded” is one of the S. Markov’s many contributions to modern mathematical 
biology. The authors greatfully acknowledge S. Markov’s generous support in 
their efforts to promote advanced methods in biological problems. 

In this paper we consider an anaerobic digester model of biological wastewater 
treatment plants (WWTPs), described by the following control system [3] 



where 



dsi 

dt 

dS2 

dt 

dxi 

dt 

dX2 

dt 

dc 

dt 

dz 

dt 



— Si) — h^iXi 


(1) 


u{s 2 — S 2 ) + k 2 HlXi — k3fl2X2 


(2) 


(Mi — au)xi 


(3) 


(M2 — au)x2 


(4) 


u{d’ — c) + k 4 fiiXi + k^^ 2 X 2 — Q 


(5) 


u{z'- - z), 


(6) 



Ml = Mi(si) 



hi + Si ’ 



M2 = M2(s2) 



M0S2 



ks2 + S2 + 




2 ■ 
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Table 1. 



Model variables and parameters 


Values 


Units 


Si concentration of chemical oxygen demand (COD) 


- 


g/1 


S 2 concentration of volatile fatty acids (VFA) 


- 


mmol/1 


xi concentration of acidogenic bacteria 


- 


g/1 


X 2 concentration of methanogenic bacteria 




g/1 


c total inorganic carbon concentration 


- 


mmol/1 


2 : strong ions concentration in the medium 


- 


mmol/1 


u dilution rate 


- 


day"^ 


Si influent concentration si 


7 


g/1 


S 2 influent concentration S 2 


70 


mmol/1 


c* influent concentration c 


65 


mmol/1 


z' influent concentration 2 : 


67 


mmol/1 


Q gaseous CO 2 molar flow rate 


20 


1/h 


fci yield coefficient for COD degradation 


10.53 


g COD/g xi 


k 2 yield coefficient for VFA production 


28.6 


mmol VFA/g xi 


fca yield coefficient for VFA consumption 


1074 


mmol VFA/g X 2 


ki yield coefficient for CO 2 production due to xi 


12.42 


mmol C02/g x\ 


ks yield coefficient for CO 2 production due to X 2 


1375 


mmol C02/g X 2 


Mmax maximum acidogenic biomass growth rate 


1.2 


day"^ 


fiQ maximum methanogenic biomass growth rate 


0.74 


day"^ 


fcsi saturation parameter associated with si 


7.1 


g COD/1 


ks 2 saturation parameter associated with S 2 


9.28 


(mmol VFA/l)i/2 


ki inhibition constant associated with S 2 


16 


mmol VFA/1 


a proportion of dilution rate reflecting process 

heterogeneity 




— 



The definition of the model variables and parameters is listed in Table 1. It 
is assumed that the dilution rate u is a control input in (l)-(6). The numerical 
values for S 2 , c* and Q are taken from the graphical outputs in [3]. 

This model has been experimentally validated for an anaerobic up-fiow fixed 
bed reactor used for the treatment of industrial wine distillery wastewater. More 
details about that can be found in [3] and in the references there. 

The paper is organized as follows. Section 2 presents steady state analysis of 
the system. Some considerations regarding the linearized system are presented 
in Section 3. A continuous feedback stabilizing asymptotically the dynamics is 
proposed in Section 4. Computer simulations are reported in Section 5. 



2 Steady States Analysis 

The steady states are obtained as solutions of a nonlinear algebraic system, 
obtained from (l)-(6) by setting 

^ = 0, ^ = 0, ^i = o. ^2 = 0 . ± = o i = o. 

dt dt dt dt dt dt 
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Excluding the trivial solutions Si = S 2 = s|, Xi = 0:2 = 0 (which are called 
wash-out steady states and are not of practical interest) , it is straightforward to 
see that for any u € U, where 



the steady states (chosen for biological reasons as one of the two possible non- 
trivial solutions) are 



The point P* = (s^, 2 ;*, X 2 , c*, z*) is called rest point of (l)-(6). 

Let U C U he a compact neighborhood of u. In the following we shall assume 
that U is the set of the admissible values of the control. We shall study the 
stabilizability of the system (l)-(6) in a neighborhood 17 of the point P*, where 
the “size” of 17 depends on the choice of U. 

3 Stability of the Linearization of the Model 

In this section we consider the system consisting of the first five ODEs (l)-(5) 
of the model with ® = 2 make the following change: 

,^2 = 32-521 ^3=Xl-xl, U = ^2 ~ X 2 , ^5 = C ~ C* , V=U~U. 

Denote ^ = (^i, ^ 2 , Csj ^s)- Using the Taylor expansion around the origin, 
the system (in the new coordinates) can be written in the form 



where C is the Jacobian of the right-hand side functions with respect to ^ at 
the point ^ = 0, u = 0; 6 is the vector of the derivatives of the right-hand side 
functions with respect to v at the point ^ = 0; R{0^) is the residual, which can 
be easily computed. 





Let us fix a point u G U. Denote 



3l = 31 ( 11 ), S 2 = 32(m), a;i = Xi(m), X 2 = X 2 {u), c* = c(u), z* = z(u) = z‘‘. 



^ + bv + R(e^) (O<0<1), 



(7) 
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A standard approach in the nonlinear control theory is to try to stabilize the 

linear approximation — = + bv of the system (7) at the origin. Using the 

at 



numerical values for the model parameters in Table 1, one can check that the 
matrix B = {b; Cb; C^b] C^b; C^b) is regular. Therefore, according to the Kalman 
controllability criterion, the above linear system is controllable at the origin. We 
decided to apply an approach based on the Ackermann formula (cf. [1,2]). Let 
Ai, i = 1, . . . , 5, be arbitrary negative real numbers. Denote 



C = C — bk^ ^ where 

k^ = e^Q{C), = (0,0, 0,0, 1)5-1 _ Ai) • • • (A - A5). 



Then the linear approximation takes the form 

It can be directly checked that the eigenvalues of the matrix C are Ai, i = 1, . . . , 5. 
Since Xi, i = 1, . . . , 5, are negative, the linear system (8) is asymptotically stable. 

Unfortunately the computer simulation showed very bad convergence of the 
solution of (8) to the rest point ^* = 0. Therefore we decided to investigate the 
Lyapunov function of (8). Using Claim 5.35 ([7], p. 211), a symmetric positive 
matrix P is found by solving the linear algebraic system C P + PC = —E, 
where E is the identity matrix. Then r;(^) = is the Lyapunov function 

of the linear system (8). Using the numerical values in Table 1 (with a = 0.5) 
and choosing Xi = —1, i = 1,...,5, the matrix P and its eigenvalues X^ = 
(Af , Af , X^, Af, X^) can be explicitely computed; the latter are 

A^ = (0.788 • lOi^ 0.472 • 10®, 1.87, 0.316, 0.159). 

Since the difference g{X^) between the maximal and minimal eigenvalue of P, 

g{X^) = max{Af , Af , Af , Af , Af } - min{Af , Af , Af , Af , Af } 

is too large, g(X^) = 0.79 • 10^^, the computer simulations show very bad con- 
vergence of the solution of (8) to the rest point ^* = 0. 

As a next step we tried to minimize g{X^) by solving a nonlinear optimization 
problem. Unfortunately, all efforts to stabilize the linear approximation (8) of 
the nonlinear system (7) were useless - in all cases the convergence to the rest 
point was very slow and practically unusable. 



4 Stabilization of the Nonlinear System 

We consider the nonlinear dynamic system (l)-(6) with a = 1. This value of 
a corresponds to the ideal continuous stirred tank reactor [3]. For a yf 1, all 
considerations can be done in a similar way. 
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Let P* = {si, $ 2 , xl, X 2 , c* , z*)'^ be the rest point corresponding to the fixed 
u (see Section 2) from the interior int U of the set U of admissible values for the 
control. It can be directly verified that the following subset of 

H = {{si,S2,Xi,X2, C,z)'^ : kiXi + Si - = 0, ^2X1 - k^X2 - S2 + S2 = 0} 



is invariant with respect to the system (l)-(6) for every admissible control u 
(i. e. for every integrable function with values from U). This means that the 
trajectory of (l)-(6) remains in H for every choice of the admissible control 
u whenever the starting point belongs to H. Moreover, starting from a point 
outside the set H, the trajectory of (l)-(6) tends to H for every choice of the 
control function u. This observation is the main motivation to define the feedback 



_T( \ fclMmaxSl^Cl , . 

u = k{si,S2,xi,X2,c,z) = ^ - d(si - Si), (9) 

(fcsi + Sl)(Sl - Si) 

where <5 > 0 is a parameter that will be chosen later in a suitable way. 



Definition 1. Let f2 he a neighborhood of the rest point P* . Any loeally Lips- 
ehitz eontinuous funetion k : ^lA is ealled admissible feedbaek. R is said that 

the admissible feedbaek k stabilizes asymptotieally the eontrol system (l)-(6) if: 
i) for every point P^ G f2 the trajectory l{-,k,P^) of (1)~(6) starting from P^ 
and corresponding to the control u = k{-) is defined for every t G [0, 00); 
a) l{-,k,P^) G fl for every t > 0; 

Hi) limt^oo l{-,k,P°) = P*. 

Theorem 1. There exists a neighborhood fl of P* and a number i5 > 0 such 
that the feedback (9) is admissible and stabilizes asymptotically the control system 
(l)-(6) to the point P* . 

Proof. Since P* G H, the positive real number 

^l/^maxSi^Ti 

(fcsi + Si)(sj — S*) 

belongs to int U. Hence, there exist £i > 0, £2 > 0, £3 > 0, £4 > 0 and (5 > 0 
such that the feedback (9) is admissible on the neighborhood 17 of the point P* , 
where 



Q = {(si, S2, xi, 3:2, c, z)^ : |fcixi + Si — Si| < £1, \k 2 X 1 — k^X 2 — S 2 + s\\ < £2, 
|si — Sil < £3, |xi — X*| < £4, Si > 0 , S2 > 0 , Xi > 0 , X2 > 0 , c > 0 , z > 0 }. 

Let = (si, S2, Xi, x°, c°, z°)^ G 17 be an arbitrary point and let l{-,k,P°) = 
{si{-),S 2 {-),xi{-),X 2 {-),c{-),z{-))^ be the solution of (1)“(6) starting from 
and corresponding to the feedback k. It can be directly checked that l{-,k,P^) 
is well defined on [0,+oo) and l{t,k,P°) G fl for every t > 0 (according to 
Theorem 2.4 ([4], p. 194)). Denote further 
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k{t) = k{si{t),S 2 {t),Xi{t),X 2 {t),c{t),^t)), 

pi{t) = kixi{t) + si{t) - s\, P2{t) = -k2Xi{t) + k 3 X 2 {t) + S2{t) - si, 
PI = Pl{sl), pI = k-2{s*2), p^{t) = Pi{si{t)), p^it) = P 2 {s 2 {t))- 



There exist constants 7 > 0 and k > 0 such that for every t > 0 

k{t) > K, p2{t) < 7 ( 10 ) 



hold true. Moreover, 




< 0 if si{t) > 

> 0 if si(t) < si ’ 



^Pi{t) = and ^p 2 {t) = -k{t)p 2 {t). 

The above relations and (10) imply pi(t) 0 and P 2 {t) 0 whenever t 00 . 

Since 5 > 0 and 

l{s3{t)-sl) = -S{Mt)-sl), 

it follows that limt_>oo si (t) = Si- Further the definition of pi(-) implies 
limt^oo xi{t) = xl - Both relations xi{t) — > x\ and si(t) — > s* lead to liiut^oo k{t) 
= u. 

Taking into account the equalities 



u{sl — S2) + k2p\xl — ksplxl = 0, — S2 — k^xl + fe2X* + S2 = 0, 



we obtain further 

i (Mt) -si) = - (k{t) - p,{t)) + {S2{t) - Sl) 

+ {k{t) - u){sl - sl) + k 2 (Jh{t)xi{t) - pIxD -P2(.t)iP2{t) + k2Xi{t) - k 2 xl). 



Denote 



m{s2{t)) 



-P2{t) + ksxl 



P2{t) - P*2 

S 2 {t) - sl ■ 



Using the expressions for P 2 {t)) and pi one gets 



m{s2{t)) 



Mo 

fcs, +S2(t)+ (^) 



/ 

k^xl 

V 






Sls2{t) 

kl 




2 



\ 

S2(t) 

/ 



By means of the numerical values of the model parameters (see Table 1) it can 
be verified that k{t) + m(s2(t)) > 0.05 in the neighborhood fl with £2 = 0.1, 
£3 = 0.01, £4 = 0.05 (and <5 = 10). Hence, there exists a constant M, such that 
for every t > 0, k{t) + m{s2(t)) > M > 0 for S2{t) yf S2- Therefore, 



^|s2(t) - S 2 I < -M|s2(t) - S 2 I + i^{t), 
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where iy{t) = \k{t)—u\\s 2 —S 2 \+k 2 \fJ.i{t)xi{t)—fj,lxl\+fi 2 {t)\P 2 {t)+k 2 Xi{t)—k 2 xl\. 
Taking into account that k(t) — > ii, P 2 {t) — *■ 0, x\{t) x^, /rj whenever 

t oo, we obtain linit_,oo v{t) = 0. The Cauchy formula for solution of linear 
ODEs and the Gronwall inequality imply 



|s2(t) - 41 







max \v{t)\ + 

sS[0,t] 



1 

sS[t,+oo) 






which means that limj^oo S 2 {i) = S 2 . 

Further the relations 

^|z(f)- 2 *| = -k{t)\z{t) - z'-\ < -K\z{t)-z"\ 

lead to limt^oo^(i) = -z*- Finally, (5), (10) and the equality u{d —c*) + kipL\x\ + 
hk* 2 xl = Q imply 

^|c(i) - c*| < -K|c(t) - c*| + fc4|7Ii(i)si(i) - p.\xl\ + k^\fl 2 {i)x 2 {t) ~ 42 ^ 2 ! 

+ \k{t) — m||c* — c*|. 



Using again the Cauchy formula and the relations k{t) — > u, JIi(t)xi(t) — > /i^a;* 
and Jt 2 it)x 2 {t) P 2 X 2 J obtain as above that limj^oo c(0 = c*. 

This completes the proof. 



5 Numerical Simulation 

All computations and graphical outputs are carried out in the computer algebra 
system Maple 7. Using the numerical values from Table 1, the admissible interval 
for the control is U = (0, 0.5359]. With u = 0.4 S C/, the rest point P* is 

4 = 3.55, 4 = 11.528, x\ = 0.32764, 4 = 0.063168, c* = 105.92, z* = 67. 

Let the initial point at t = 0 be 

si(0) = 3.544, S2(0) = 13, a;i(0) = 0.363, 2 : 2 ( 0 ) = 0.1, c(0) = 103, z(0) = 75. 

Using the feedback (9) with 5 = 10, the system (l)-(6) is solved numerically on a 
uniform mesh ti = ih, i = 1, 2, . . . , 150 with h = 0.1. The numerical outputs are 
visualized in different phase planes (see Figures 1 and 2). In all plots the symbol 
circle o means the projection of the starting point and the symbol diamond o 
denotes the projection of the rest point in the corresponding coordinates. 

Figure 1 (left) visualizes the projection of the neighborhood Q in the 

(si, 2 ;i)-plane, = {(si,a;i) : |fcio;i + si - sjj < 0.5, |si - Si| < 0.007}; at 

all points below the curve g : fc(si, 2 ;i) = 0.5359 the feedback law is admissible. 
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Fig. 1. Plots of the trajectories in (si,a;i) (left) and (s 2 ,® 2 ) (right) coordinates 





Fig. 2. Plots of the trajectories in (z, c) (left) and (c, xi) (right) coordinates 
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Abstract. An numerical approach for numerical approximation of tra- 
jectories of a smooth affine control system is proposed under suitable as- 
sumptions. This approach is based on expansion of solutions of systems 
of ordinary differential equations (ODE) by Volterra series and allows to 
estimate the distance between the obtained approximation and the true 
trajectory. 



1 Introduction 

To apply the traditional numerical schemes (such as Runge-Kutta schemes) of 
higher order for approximation trajectories of nonlinear control systems is a non- 
trivial task (cf. for example [4, 11] etc.) due to the nonsmoothness of the control 
functions. For the case of linear differential inclusions, this problem is studied in 
[9] . There is proposed a numerical procedure based on a suitable approximation 
of integrals of multivalued mappings (cf. for example [10]) and on the ideology 
of algorithms with result verification (cf. for example [2]). Another approach for 
approximating trajectories of affinelly controlled systems is proposed in [5] and 
[6]. This approach is based on the expansion of the solution of the systems of 
ordinary differential equations by Volterra series (cf. for example [7]). In this 
note, combining this approach with the ideas developed in [9], we propose a 
method for approximation of trajectories of analytic control systems with guar- 
anteed accuracy. We would like to point out that our approach can be applied 
for general control systems that are smooth with respect to the phase variables 
and continuous with respect to the control. 

2 Systems of ODE and Volterra Series 

First, we introduce briefly some notations and notions: For every point y = 
(y^, . . . , y”)^ from i?" we set jjj/jj := YLl=i W\ B be the unit ball in 

i?" (according to this norm) centered at the origin. Let C denote the set of all 
complex numbers, Xq € i?" and 1? be a convex compact neighbourhood of the 
point Xq. If z € C, then by \z\, Re z and Im z we denote its norm, its real and 
imaginary parts, respectively. Let cr > 0. We set := 

{z = {zi , . . . ,z„) e C" : (Rezi, . . . ,Rez„) G 17 -I- ctR, ]Im Zi\<a,i = l ,.. . ,n}. 
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By Tq we denote the set of all real analytic functions defined on 17, such that 
every <f) € Tq has a bounded analytic extension (p on 17*^ . We define a norm in 
the set Tq as follows: 

UWn = sup {\4’{z)\ ■ z G 

Let h{x) = (hi{x) , h 2 {x ) , . . . , hn{x))'^ , a; G 17, be a vector field defined on 17. As 
usually, we identify h with the corresponding differential operator 

^ d 

ht{x) — , a; G 17. 

i—l 

Let Vf 2 be the set of all real analytic vector fields h defined on 17 such that every 
hi,i = 1, . . . , n, belongs to Tq. We define the following norm in V^: 



\\h\\n = max (||/i*||? 2 , i= l,...,n). 

An integrable analytic vector field Xt{x) = x),X 2 {t,x), . . . , a;)), a; G 

(2, t G R (parameterized by f), is a map t — > At G such that: 

i) for every a; G 17, the functions Ai(., a;), A 2 (-, a;), . . . , A„(-, x) are measurable; 

ii) for every t G R and x G 17, |At(l, x)| < m{t), i = 1,2, ... ,n, where m is an 
integrable (on every compact interval) function. 

Further we shall consider only uniformly integrable vector fields At, i.e. 




0 whenever 



1*2 



0 . 



Let M be a compact set contained in the interior of 17 and containing the point 
xq, and let At be an integrable analytic vector field defined on 17. Then there 
exists a real number T{M, At) > to such that for every point x of M the solution 
y{.,x) of the differential equation 



y{t,x) = Xt{y{t,x)), y{to,x)=x. 



( 1 ) 



is defined on the interval [to,T{M,Xt)] and y{T,x) G 17 for every T from 
[to,T{M)\. In this case we denote by exp Xt dt : M ^17, the diffeo- 
morphism defined by 



exp / At dt (x) := y(T,x). 



According to proposition 2.1 from [1], T{M,Xt) > to can be chosen in such a 
way that for every T, 0 < T < T{M, At), for every point x from M and for every 
function (j> from Xq, the following expansion of 4> fexp At dt (x) j holds true: 



f At dt (x) I = exp f At dt 4>{x) , 

J Iq J J Iq 



exp 



( 2 ) 
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where exp Xt dt (j){x) = 

°° pT pTi pT2 pTN-l 

= </>(x)+^ / / / ... . . .Xr2Xri4‘{x)dTNdTN-i . . .dTi. 

and the series is absolutely convergent. The proof of these identities is based on 
the following estimate: for every positive real numbers cti < a, for every point 
X oi M, for every function <p from Tq and for every points , T 2 , ti, 

from [to,T], the following inequality holds true < 

^ ( 2 ) 

This estimate implies the following technical lemma: 

Lemma 1. Let M he a eonvex eomnaet subset of f2, tb S Tn, X* € Vn and 
to < Ti < T 2 < • • • < < T. ITe set 

^Ti,T2,...,Tf_, •= 

Then for every 0 < a\ < a and for every two points y\, j /2 of XI the following 
inequality holds true: 



(2/2) ^ri,T2,...,r^(yi)| < 

/ o \ At+1 

^ + 1 )' 

Let the functions Ei : ^ i?, z = 1, . . . , n, be defined as follows: 

Ei{zi, ..., Zn) = Zi. "We set E := (El, ..., En)"^ , Xt E := (Xt E\, . . . ,XtEnf^ . 
Applying Lemma 1 we obtain that for every two points y\ and j /2 from M the 
following estimate holds true 

II . . . X,,X,,E(y^) -X,^... Xr,Xr,E(yi)\\ < 

^ ■ ■ ■ \\Xr2VM ' Il^rj|^||y2 ~ 2/l||. (4) 

We use the estimate (4) to prove an existence criterion for exp Xt dt (x) 
where x belongs to some compact set M: 

Proposition 1. Let XI and XI\ he convex compact subsets of 12 , a > a\ > 0, 
T > to and lo be a positive integer such that for every point x £ XI and for every 
t £ [to,T] the following relations hold true: 

0<n(^^J^^\\X^rM^ds] <1 (5) 
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^ } rt pTi pT2 pTN-1 

X ~h ^ ^ / / / j ^TN — l ■ ■ ■ ^T2^T-i ^ (^X^d,T]\J dT]\J—1 . . . (1T\ 

jy— *^^0 

/ c\ \ ^—1 / \‘^ 

-S G Ml. (6) 

Then exp dt (x) is well defined for every point x G M . Moreover, 

exp Xt dt (x) G Ml for every point x £ M and for every r from [to,T]. 

Proof. Let x be an arbitrary point from M. By C(Mi; [to,T]) we denote the set 
of all continuous functions defined on [to,T] with values from the set Mi. We 
can define the following operator F : C(Mi; [to,T]) — *■ C(Mi; [to,T]) as follows: 
F{y){T) = x+ 



^ ^ pr pTi pT2 pTN-1 

^ ^ / / / I ^'^N^'l'N — 1 ■ ■ ■ — 1 ■ ■ ■ dTl~\- 



ff 

'^to J to 





■ Xr.^Xr^E{y{T,j))dT,jdT,j_l . . . dTi, 



Let II • lie denote the usual uniform norm in C{XIi, [to,T]), i.e. 



Ilyllc = max{||?/(t)|| : t G [to,^]}, 

where y{t) = (yi(t), . . . , y„(t))^, t G [to,T]. This operator is well defined on 
C{Mi; [to,T]). Let yi and j /2 be arbitrary elements of C{XIi; [to,T])- Applying 
the estimate (4), we obtain that 



\\F{y 2 ) - F{yi)\\c = ma.x {\\F{y 2 ) (t) - F(j/i)(t)|| : r G [to,T]} < 



pT pTi pT 2 



<maxW / / ... ...X,,X,,.E( 2 / 2 (r^)) - 

KJto Jto Jto Jto 

Xr^Xr^__^...Xr2Xr^E{yi{Tuj))\\dT„jdT,„_l...dTi: tG [toX]} 
2n M r-ri 



< nuj\ 



a — ai 



’to J to ^ to 



'to 



IX 



T^llMi •• • IIXtsIImi ■ II-^ti||mi • dr,,dT,^-i . . . dri\\y 2 - yiWc = 

2n 



= n 



<7 — Cl 



T 

\\Xs\\M,ds\ ||y2-2/i||c = L||y2-2/i||c, 



where 



L := n 



2n 



^ Jto 



I|Xs||mi ds 



Since 0 < L < 1, the differential operator E is contractive. Applying the Ba- 
nach fixed point theorem we obtain that there exists an unique function y 
from C(Mi; [to,T]) such that F{y) = y. The last relation means that y{t) = 
exp Xt dt (x). This completes the proof. 
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The next proposition helps us to estimate the global approximation error 
using the local approximation error and motivates our approximation procedure. 

Proposition 2. Let e > 0, tr > cti > 0, exp Xt dt (xq) is well defined and 
exp Jj* Xt dt (xq) € f2 for every t £ [to:T]. Let to < ti < ■ ■ ■ < tk = T , 



2n 



(7 — (Ti 




1 Ti 

< - and - 

Z LU 



2n 

a — ai 




K 

3A:“+i ’ 



Then there exist compact subsets Mi of Q such that exp Xt dt (xq) belongs 
to Mi, i = Q,l, . . . ,k, and 



K ( 4n^ /■** \ 

diam Mi < — exp J \\Xs\\adsj 

where diam M := max [\\y 2 - yi\\ : yj = (yj, . . • £ M, j = 1,2}. 

Proof. We set Mq := {xq} and define the sets Mi, i = 1,2, . . . , k — 1, as follows: 
If Mi is already defined, we set := {F{x) : x G Mi} n L2, where F{x) := 



CJ — 1 

E 

N =1 • 



— t ■ ■ ■ .^T2 (x)dT7V dT^V — 1 . . . dx\ 



+ 



n 



ui 






■ B. 



We set yo = xq, yi+i = exp Xt dt (yi), i = 1,2, ... ,k. It can be directly 
verified that yt £ Mi and 



yk = exp 




Xt dt (xo) G Mfe. 



Let us denote diam Mi by di. Clearly, do = diam Mq = 0. Let us assume that 
di+i < \\F{y 2 ) - F{yi)\\ + K/3k“, where yi, y 2 £ M*. Then 



di+i <^\y2~y\ 

2=1 



K 



^ r'^i + l c^l CT2 pTN-l 

S / / / ■/ 

A r 1 ^ tj ti ti ti 



+ 



— 1 ■ ■ ■ ^T2^ti ^ (y2 C?T/V — 1 ■ ■ ■ dTf 



iV=l 



fti + i pTi pT 2 pTN -1 

/ / / . . . / Xr^Xrp^_^ . . . Xr^Xr^E{yi)dTNdTN-l . . . C^Ti 

Jti Jti Jti Jti 



< Ily2-yi| 



K 

ka-\-l 



Jti 
LJ — l 

N^l 



< 



2n 



a — (Ji 



N 



Ml' 



N 



\\y 2 -yi\\< 
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\\V2 



yi\\ 



4n 

1 + 

(T — (Tl 




||Xs||^ ds 



Hence, we have proved that 



K 

ka+l ■ 



di+1 < dt{l + (5j) + Z\, i = 0, 1, . . . , fc - 1, 
where A = and 

4„2 rti+i 

d. = / WXsVmJs. 

<y — Jti 

Applying theorem 1.7 (p. 182, [3]), we obtain that for every 



di < exp 



i = l,2,...,fc. 



Substituting Sj and zi we complete the proof. 



3 A Computational Procedure 

Let us consider the following control system: 

d ^ 

— a:(t) = fo{x{t)) + Uifi{x{t)), a;(0) = xq, (7) 

i=l 

where the state variable x belongs to i?", /o, /i, . . . , /m are real analytic vector 
functions and the admissible controls u = (ui,U2, ■ ■ ■ ,Um) are the Lebesgue 
integrable functions. 

Let e > 0, T > 0 and u : [to, IL] — > C/ be an admissible control. First, using 
Proposition 1 we can find a positive real T (not greater than T) and a compact 
subset fi of i?" containing the point Xq such that the corresponding trajectory 
X : [to,T] —f i?" is well defined on [to,T] and x{t) G 12 for every t from [to,T], 
Next, we show how we can calculate a point y C 12 such that |a;(T) — y| < e: We 
assume that fi G Vq for some cr > 0, \uj\ < vj, j = 1,2, . . . and ||/i||^ < C 
for * = 0, 1, . . . , m. We set t'o := 1 and 



m 

i=l 

Let 0 < tTi < (T. For every positive integer k we set h := (T — to)/k and 
define the points 0 = to < ti < ■ ■ ■ < tk, where ti = to + ih. We choose k to be 
sufficiently large, so that the following relation holds true: 

4nhC 

< 1 . 



cr — iTi 
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Next we choose a value for a > 1 such that the following inequality holds true: 



> - exp 
£ 



/ 4n^(T-fo)C \ 

V (ct-cti) / 



At the end we determine the accuracy of the desirable local approximation by 
choosing the positive integer w to be so large that 



/2n{T-to)C 

V 



UJ 

< 1 and 



nhC 

UJ 



( 2nhC 
\cr - (Ti 



UJ — 1 

< 1 . 



Since the trajectory x : [to,T] — *■ i?” is well defined on [0,T] and belongs to 
we obtain according to Proposition 2 that \x{T) — z\ < e, where z := Zk and for 
every z = 0, 1, . . . , /c - 1, Zi+i := 



UJ— 1 







. . XT^XT^E{zi)dT]S!dT]^-i . . . dri. 



In [8] is studied the following control systems: 

X\ = u |m| < 1 

±2 = 3xf x(0) = (0,0) 



All time-optimal controls are piecewise constant with at most two pieces. The 
constant controls u = I and u = —1 generate the lower boundaries of the 
reachable sets s — > (s, |s^|), and s — *■ (s, |s^| — s^), 0 < s < T, respectively. The 
controls with Us{t) = ±1 for 0 < f < s and Us{t) = for s < t < T steer to the 
curves of endpoints s — > (2s — T, 2s^ + {T — 2s)^) and s — > (2s — T, 2s^ + {T — 
2s)^ — (T — 2s)^), respectively. 

Figure 1 shows the approximate reachable set for the system at the moment 
t = I using first and second order terms in Volterra series. The timestep is 
h = 

30 




The exact reachable set is presented on Fig. 2, using also third order terms 
in Volterra series. It is enough to take timestep equal to 1 in this case. 
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Fig. 2. The exact reachable set 
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Abstract. In many application areas, it is important to detect outliers. 
Traditional engineering approach to outlier detection is that we start 
with some “normal” values xi, . . . ,Xn, compute the sample average E, 
the sample standard variation cr, and then mark a value x as an outlier 
if X is outside the fco-sigma interval [E — ko ■ a, E + ko ■ a] (for some pre- 
selected parameter ko). In real life, we often have only interval ranges 
[giijXi] for the normal values xi, . . . ,Xn. In this case, we only have in- 
tervals of possible values for the bounds E — ko ■ a and E ^ ko ■ cr. We 
can therefore identify outliers as values that are outside all fco-sigma in- 
tervals. In this paper, we analyze the computational complexity of these 
outlier detection problems, and provide efficient algorithms that solve 
some of these problems (under reasonable conditions). 



1 Introduction 

In many application areas, it is important to detect outliers, i.e., unusual, ab- 
normal values. In medicine, unusual values may indicate disease (see, e.g., [7]); 
in geophysics, abnormal values may indicate a mineral deposit or an erroneous 
measurement result (see, e.g., [5,9,13,16]); in structural integrity testing, ab- 
normal values may indicate faults in a structure (see, e.g., [2,6,7,10,11,17]), 
etc. 

Traditional engineering approach to outlier detection (see, e.g., [1, 12, 15]) is 
as follows: 

— first, we collect measurement results X\, . . . ,Xn corresponding to normal sit- 
uations; 

d©! “T* 1 I I rr* 

— then, we compute the sample average E = n — ~ of these nor- 

mal values and the (sample) standard deviation a = y/V, where V 

(xi — E)"^ {xn — E)'^ _ 

n > 

I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 238-245, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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— finally, a new measurement result x is classified as an outlier if it is outside 
the interval [L, U] (i.e., if either x < L or x > U), where L E — ko ■ a, 

def 

U = E + ko ■ a, and fco > 1 is some pre-selected value (most frequently, 
ko = 2, 3, or 6). 

In some practical situations, we only have intervals x, = j, Xi] of possible values 
of Xi- This happens, for example, if instead of observing the actual value Xi of 
the random variable, we observe the value Xi measured by an instrument with a 
known upper bound Ai on the measurement error; then, the actual (unknown) 
value is within the interval Xi = [xi — Ai,Xi + Ai], For different values Xi G Xi, 
we get different bounds L and U . Possible values of L form an interval - we will 

denote it by L = [L, L]; possible values of U form an interval U = [V_, U]. 

How do we now detect outliers? There are two possible approaches to this 
question: we can detect possible outliers and we can detect guaranteed outliers: 

— a value x is a possible outlier if it is located outside one of the possible 
fep-sigma intervals \L, U] (but is may be inside some other possible interval 

— a value x is a guaranteed outlier if it is located outside all possible fco-sigma 
intervals \L,U]. 

Which approach is more reasonable depends on a possible situation: 

— if our main objective is not to miss an outlier, e.g., in structural integrity 
tests, when we do not want to risk launching a spaceship with a faulty part, 
it is reasonable to look for possible outliers; 

— if we want to make sure that the value x is an outlier, e.g., if we are planning a 
surgery and we want to make sure that there is a micro-calcification before we 
start cutting the patient, then we would rather look for guaranteed outliers. 

The two approaches can be described in terms of the endpoints of the intervals 
L and U: 

A value x guaranteed to be normal - i.e., it is not a possible outlier - if 
X belongs to the intersection of all possible intervals [T, C/]; the intersection 
corresponds to the case when L is the largest and U is the smallest, i.e., this 
intersection is the interval [L,C/]. So, if x > C/ or x < L, then x is a possible 
outlier, else it is guaranteed to be a normal value. 

If a value x is inside one of the possible intervals [L, [/], then it can still be 
normal; the only case when we are sure that the value x is an outlier is when x 
is outside all possible intervals [L, U], i.e., is the value x does not belong to the 
union of all possible intervals [L, U] of normal values; this union is equal to the 
interval \L, U]. So, if x > C/ or x < L, then x is a guaranteed outlier, else it can 
be a normal value. 

In real life, the situation may be slightly more complicated because, as we 
have mentioned, measurements often come with interval inaccuracy; so, instead 
of the exact value x of the measured quantity, we get an interval x = [x, x] of 
possible values of this quantity. 
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In this case, we have a slightly more complex criterion for outlier detection: 

— the actual (unknown) value of the measured quantity is a possible outlier 
if some value x from the interval [x^x] is a possible outlier, i.e., is outside 
the intersection \L,U]] thus, the value is a possible outlier if one of the two 
inequalities hold: x < L or U_ <x. 

— the actual (unknown) value of the measured quantity is guaranteed to be an 
outlier if all possible values x from the interval [x,x] are guaranteed to be 
outliers (i.e., are outside the union [i, C/]); thus, the value is a guaranteed 
outlier if one of the two inequalities hold: x < L or U < x. 

Thus: 

— to detect possible outliers, we must be able to compute the values L and U_] 

— to detect guaranteed outliers, we must be able to compute the values L 
and U. 

In this paper, we consider the problem of computing these bounds. 

2 What Was Known before 

As we discussed in the introduction, to detect outliers under interval uncertainty, 
we must be able to compute the range L = [L, L] of possible values of L = 
E — ko ■ a and the range U = [l^, U] of possible values of U = E + ko ■ a. 

In [3,4], we have shown how to compute the intervals E = [E, E] and [ct,ct] 
of possible values for E and cr. In principle, we can use the general ideas of 
interval computations to combine these intervals and conclude, e.g., that L al- 
ways belongs to the interval E — /cq • [ct, ct] . However, as often happens in interval 
computations, the resulting interval for L is wider than the actual range - wider 
because the values E and a are computed based on the same inputs xi, ... ,Xn 
and cannot, therefore, change independently. 

We mark a value x as an outlier if it is outside the interval [L,U]. Thus, if, 
instead of the actual ranges for L and U, we use wider intervals, we may miss 
some outliers. It is therefore important to compute the exact ranges for L and 
U . In this paper, we show how to compute these exact ranges. 

3 Detecting Possible Outliers 

To find possible outliers, we must know the values U_ and L. In this section, we 
design feasible algorithms for computing the exact lower bound JJ_ of the function 
U and the exact upper bound L of the function L. Specifically, our algorithms are 
quadratic-time, i.e., require 0{v?) computational steps (arithmetic operations or 
comparisons) for n interval data points = [x^,Xi\. 

The algorithms Ajj for computing U_ and Al for computing L are as follows: 

— In both algorithms, first, we sort all 2n values x^, Xi into a sequence X(i) < 

X( 2 ) < < X(^ 2 n): t^ke X(q) = — oo and X( 2 „+i) = - 1 - 00 . Thus, the real 

line is divided into 2n -|- 1 zones (x(q),X(i)], [x(i),X( 2 )j, ..., [x( 2 „_i), X( 2 n)j, 
[x{2n),X(^2n-\-l))- 
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— For each of these zones [a;(fc), a;(fe+i)], fc = 0, 1, . . . , 2n, we compute the values 
efc=^ 

Wfe=^ Y (2) 

and Uk = the total number of such i's and j’s. Then, we solve the quadratic 
equation 

A-B-fi + C-fi'^ = 0, ( 3 ) 

where 

A Cfc • (1 + a^) — ■ ruk ■ n] a l/^o, (4) 

B 2 • Cfc • ((1 + a^) • rife - • n) ; C rife • ((1 + a^) • rife - • u) . (5) 

For computing U_, we select only those solutions for which /i • rife < Cfc and 
^ G [a;(fe), a;(fc+i)]; for computing U, we select only those solutions for which 
^•rife > Cfe and G [a^(fe), a;(fc+i)]- For each selected solution, we compute the 
values of 




n 



n - rife 





Mfe 



m-fc 

n 



n-Uk 

n 



( 6 ) 



and, correspondingly, 

Uk = Ek + ko ■ \J Mk — {EkY or Lk = Ek — ko ■ \J Mk — {EkY (7) 



— Finally, if we are computing U^, we return the smallest of the values Uk] 
if we are computing L, we return the smallest of the values Lk- 

Theorem 1. The algorithms Ajj and Al always compute U_ and L in quadratic 
time. 



Comment. The main idea of this proof is given in the last (Proofs) section. The 
detailed proofs are given in http://www.cs.utep.edu/vladik/2003/tr03-10c.ps.gz 
and in http://www.cs.utep.edu/vladik/2003/tr03-10c.pdf 



4 In General, Detecting Guaranteed Outliers Is NP-Hard 

As we have mentioned in Section 1, to be able to detect guaranteed outliers, we 
must be able to compute the values L and U. In general, this is an NP-hard 
problem: 

Theorem 2. For every /cq > 1, computing the upper endpoint U of the interval 
[U, U] of possible values of U = E + ko • a is NP-hard. 

Theorem 3. For every fco > 1, computing the lower endpoint L of the interval 
[L, L] of possible values of L = E — k^ ■ a is NP-hard. 
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Comment. For interval data, the NP-hardness of computing the upper bound 
for cr was proven in [3] and [4] . The general overview of NP-hardness of compu- 
tational problems in interval context is given in [8] . 



5 How Can We Actually Detect Guaranteed Outliers? 



How can we actually compute these values? First, we will show that if 1 -I- 
(1/fco)^ < n (which is true, e.g., if > 1 and n > 2), then the maximum of U 
(correspondingly, the minimum of L) is always attained at some combination of 
endpoints of the intervals x^; thus, in principle, to determine the values U and 
L, it is sufficient to try all 2" combinations of values and Xi. 

Theorem 4. // 1 -I- (1/fco)^ < n, then the maximum of the function U and the 
minimum of the function L on the box xi x . . . x x„ are attained at its vertices, 
i.e., when for every i, either Xi = x^ or Xi = Xi- 



NP-hard means, crudely speaking, that there are no general ways for solving 
all particular cases of this problem (i.e., computing V) in reasonable time. 

However, we show that there are algorithms for computing U and L for 
many reasonable situations. Namely, we propose efficient algorithms that com- 
pute U and L for the case when all the interval midpoints ( “measured values” ) 

Xi {xi + Xi)/2 are definitely different from each other, in the sense that the 
“narrowed” intervals 



~ 1 + a^ _ 1-1- 

Xi ■ 2I2, Xi ■ ZSi 



( 8 ) 



- where a = 1/kQ and Ai = (xj — Xi)/2 is the interval’s half-width - do not 
intersect with each other. 

The algorithms Ajj and Aj^ are as follows: 

— In both algorithms, first, we sort all 2n endpoints of the narrowed intervals 

1 I 2 1 I 2 

Xi ^ ~n ■ ^ sequence X(i) < X( 2 ) < . . . < X(^ 2 n) ■ 

This enables us to divide the real line into 2n-|-l segments (“small intervals”) 

[x(q, X(i_i_i)], where we denoted X(q) —00 and X( 2 n-i-i) + 00 . 

— For each of small intervals [s(i), a;(i+i)], we do the following: for each j from 
1 to n, we pick the following value of Xj : 

_ 1 -I 2 

• if < Xj — ^ ■ Aj, then we pick Xj = Xj] 

• if X(i_|_i) > Xj -I ■ Aj, then we pick Xj = Xj] 

• for all other j, we consider both possible values Xj = Xj and Xj = Xj. 

As a result, we get one or several sequences of Xj for each small interval. 

— To compute U, for each of the sequences Xj, we check whether, for the 
selected values xi, . . . ,x„, the value of if — a • cr is indeed within the cor- 
responding small interval, and if it is, compute the value U = E + ko ■ a. 
Finally, we return the largest of the computed values U as U . 
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— To compute L, for each of the sequences Xj , we check whether, for the selected 
values Xi, . . . ,Xn, the value of if + a • cr is indeed within the corresponding 
small interval, and if it is, compute the value L = E — ■ a. Finally, we 

return the smallest of the computed values L as L. 

Theorem 5. Let l/n+ 1/fcg < 1. The algorithms Ajj and Aj^ eompute U and 
L in quadratic time for all the cases in which the “narrowed” intervals do not 
intersect with each other. 

These algorithms also work when, for some fixed C, no more than C “nar- 
rowed” intervals can have a common point: 

Theorem 6. Let 1 -|- (l/feg)^ < n. For every positive integer C , the algorithms 
Au and Aj^ compute U and L in quadratic time for all the cases in which no 
more than C “narrowed” intervals can have a common point. 

The corresponding computation times are quadratic in n but grow expo- 
nentially with C . So, when C grows, this algorithm requires more and more 
computation time. It is worth mentioning that the examples on which we prove 
NP-hardness correspond to the case when n/2 out of n narrowed intervals have 
a common point. 



6 Proofs: Main Idea 



Our proof of Theorem 2.1 is based on the fact that when the function U{xi , . . ., 
Xn) attains its smallest possible value at some point then, for 

every i, the corresponding function of one variable 



U,{x,) U{xl 



opt 



opt opt 

rfri 1 rft . rft r 



■aT) 



(9) 



- the function that is obtained from Lf{x \, . . . , Xn) by fixing the values of all the 
variables except for Xi - also attains its minimum at the value Xi = 

A differentiable function of one variable attains its minimum on a closed 
interval either at one of its endpoints or at an internal point in which its first 
derivative is equal to 0. 

This first derivative is equal to 0 when a + ■ {xi — E) = 0, i.e., when 

Xi = E — a ■ a, where a = l/ko. Thus, for the optimal values a;i, . . . for 
which U attains its minimum, for every i, we have either Xi = Xj, or Xi = Xi, or 
Xi = E — a • a. 

We then show that if the open interval (Xi,Xi) contains the value E — a ■ a, 
then the minimum of the function cannot be attained at points Xi or Xj and 
therefore, has to be attained at the value Xi = E — a ■ a. 

We also show that: 



— when E — a ■ (T < x^, the minimum cannot be attained for Xi = Xi and 
therefore, it is attained when Xi = xp, 

— when Xi < E — a ■ a, the minimum cannot be attained for Xi = Xj and 
therefore, it is attained when Xi =Xi. 
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Due to what we have proven, once we know how the value ^ = E — a ■ a \s 
located with respect to all the intervals we can find the optimal values 

of Xi. Hence, to find the minimum, we need to analyze how the endpoints x^ and 
Xi divide the real line, and consider all the resulting sub-intervals. 

Conclusions 

In many application areas, it is important to detect outliers. Traditional en- 
gineering approach to outlier detection is that we start with some “normal” 
values xi, . . . , Xm compute the sample average E, the sample standard variation 
(7, and then mark a value x as an outlier if x is outside the fco-sigma interval 
[E — ko ■ a, E + ko ■ cr] (for some pre-selected parameter ko). 

In real life, we often have only interval ranges = [Xj^,Xi] for the normal 

d0f 

values Xi, . . . ,Xn- For different values Xi € x^, we get different values of L = 
E — kf) ■ a and U ‘= E + ko ■ a - and thus, different fco“Signi£^ intervals [L,U]. 
We can therefore identify guaranteed outliers as values that are outside all ko~ 
sigma intervals, and possible outliers as values that are outside some /co-sigma' 
intervals. To detect guaranteed and possible outliers, we must therefore be able 
to compute the range L = [L, L] of possible values of L and the range U = [D, U] 
of possible values of U. 

In our previous papers [3,4], we have shown how to compute the intervals E = 
[Ki E] and [a, a] of possible values for E and a. In principle, we can combine these 
intervals and conclude, e.g., that L always belongs to the interval E — /cq • [ct, ct]. 
However, the resulting interval for L is wider than the actual range - wider 
because the values E and a are computed based on the same inputs xi, ... ,Xn 
and are, therefore, not independent from each other. 

If, instead of the actual ranges for L and U, we use wider intervals, we may 
miss some outliers. It is therefore important to compute the exact ranges for L 
and U. 

In this paper, we showed that computing these ranges is, in general, NP-hard, 
and we provided efficient algorithms that compute these ranges under reasonable 
conditions. 
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Abstract. The present work is devoted to computation with zonotopes 
in the plane. Using ideas from the theory of quasivector spaces we formu- 
late an approximation problem for zonotopes and propose an algorithm 
for its solution. 



1 Introduction 

Zonotopes are convex bodies with simple algebraic presentation: they are Min- 
kowski sums of segments. By means of a translation (addition of a vector) any 
zonotope can be centered at the origin, therefore without loss of generality we 
can restrict our considerations to centered (origin symmetric) zonotopes. The 
latter are positive combinations of centered unit segments [7, 8] . A fixed sys- 
tem of centered unit segments generates a class of zonotopes consisting of all 
positive combinations of the unit segments. This class is closed under addition 
and multiplication by scalar and forms a quasilinear space [2, 5, 6]. A quasilinear 
space over the field of reals is an additive abelian monoid with cancellation law 
endowed with multiplication by scalars. Any quasilinear space can be embedded 
in a group; in addition a natural isomorphic extension of the multiplication by 
scalars leads to quasilinear spaces with group structure called quasivector spaces 
[4] . Quasivector spaces obey all axioms of vector spaces, but in the place of the 
second distributive law we have: {a + j3) * c = a * c -I- /3 * c, if a/3 > 0. If, in 
addition, the elements satisfy the relation: (—1) * c = c, then the space is called 
symmetric quasilinear space. 

Every quasivector space is a direct sum of a vector space and a symmetric 
quasi vector space [3,4]. On the other side, the algebraic operations in vector 
and symmetric quasivector spaces are mutually representable. This enables us 
to transfer basic vector space concepts (such as linear combination, basis, dimen- 
sion, etc.) to symmetric quasivector spaces. Let us also mention that symmetric 
quasivector spaces with finite basis are isomorphic to a canonic space similar 
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to M" [4]. These results can be used for computations with zonotopes as then 
we practically work in a vector space. Computing with centered zonotopes is 
especially simple and instructive [1]. In the present work we demonstrated some 
properties of centered zonotopes in the plane. In particular, an approximation 
problem related to zonotopes has been formulated and solved by means of a nu- 
merical procedure and a MATLAB program. Our procedure allows us to present 
approximately any centered zonotope in the plane by means of a class of zono- 
topes over a given basis of centered segments. 

2 Quasivector Spaces 

By R we denote the set of reals; we use the same notation for the linearly 
ordered field of reals M = (R, -I-, •, <). For any integer n > 1 denote by R" the 
set of all n-tuples (oi, 02 , Q!„), where at G R. The set R" forms a vector space 

V” = (R”,-|-,R, •) under addition and multiplication by scalars. 

Every abelian monoid (A4, -I-) with cancellation law induces an abelian group 
(V(M), -h), where V(Ai) = consists of all pairs {A, B) factorized by the 

congruence relation {A, B) ~ (C, D) iff A + D = B + C , ior all A, B,C,D G 
M.. Addition in V(A4) is {A, B) + (C, D) = {A + B + D). The null element of 
V{M.) is the class (Z, Z), Z G M; we have (Z, Z) ~ (0, 0). The opposite element 
to {A, B) G T>{M) is opp(A, B) = (B,A). All elements of V{M) admitting the 
form {A, 0) are called proper and the remaining are improper. The opposite of a 
proper element is improper; opp(A,0) = (0, A). 

Definition 1. Let (A4,-l-) he an abelian monoid with cancellation law. Assume 
that a mapping (multiplication by scalars) is defined on R x At satisfying: 
ij'y* {A + B)='y*A + 'j*B, ii) a * {P * C) = (a/3) * C, Hi) 1 * A = A, iv) 
{a + P)*C = a*C + P*C, if a/3 > 0. The algebraic system (A4,-I-,R, *) is 
called a quasilinear space over R. 

Every quasilinear space (Ad, -h, R, *) can be embedded into a group {T>{M),+). 
Multiplication by scalars is naturally extended from R x Ad to R x T>(Ad) 
by means of: 



j * {A,B) = {j * A,^ * B), A, B G M, 7 € R. (1) 

In the sequel we shall call quasilinear spaces of group structure, such as 
I?(Ad), quasivector spaces, and denote their elements by lower case roman letters, 
e. g. a = (Ai, A 2 ), Ai, A 2 G Ad. 



Definition 2 . [f] A quasivector space (over R), denoted (Q,-|-,R, *), is an 
abelian group (Q, -I-) with a mapping (multiplication by scalars) R x Q — > 
Q, such that for a,b,c G Q, a, /3 , 7 G R.' 7 * (a -I- &) = 7 * a -|- 7 * 6 , a * (/3 * c) = 
(a/3) * c, 1*0 = 0 , {a + P)*c = a*c+P*c, if a/3 > 0. 
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Proposition 1. [3] Let *) he a quasilinear space overM., and (Q, +), 

Q = T>{M), be the induced abelian group. Let * : R x Q — > Q he multiplication 
by scalars defined by (1). Then (Q, +,M, *) is a quasivector space over R. 

Let a be an element of a quasivector space (Q,+,R,*),aS Q. The operator 
~^a = (— 1 ) * a is called negation] in the literature it is usually denoted —a = 
(— 1) * a. We write a ^ b = a+ note that a ^ a = 0 may not generally hold. 
From opp(a) + a = 0 we obtain ^opp(a) ^ a = 0, that is ^opp(a) = opp(^a). 
We shall use the notation a_ = ^opp(a) = opp(^a); the latter operator is 
called dualization or conjugation. The relations ^opp(a) = opp(^a) = a_ imply 
opp(a) = = (^a)_, shortly opp(a) = ^a_. Thus, the symbolic notation 

^a_ can be used instead of opp(a), and, for a G Q we can write a ^ a_ = 0, resp. 
^a_ + a = 0. We also note that some vector space concepts, such as subspace, 
sum and direct sum are trivially extended to quasivector spaces [ 4 ] . Some 

rules for calculation in quasivector spaces are summarized in [ 4 ]. 

Example 1. For any integer fc > 1 the set R^ of all fc-tuples (oi, 02 , ctfe), 
tti € R, with {ai, a 2 , ...,ak) = (/?i,/ 32 , whenever ai = Pi,a 2 = 

Oik = Pk, forms a quasivector space over R under the operations 

(ai, 02) ••■) Oik) + (/?!) P2, ■■•) Pk) = (o!l + /3l, 02 + /?2) ••■) Oik + Pk), ( 2 ) 
7 * (01,02, = (|7|ai, l7|a2,---, |7|afc), 7 G R- (3) 

This quasivector space is denoted by = (R^,+,R, *). Negation in is 
the same as identity while the opposite operator is the same as conjugation: 

Opp(oi, 02, ..., Ofc) = (oi, 02, ..., Ofc)_ = (-01,-02, ..., — Ofc). ( 4 ) 

The direct sum 0 of the Z-dimensional vector space = (R*, + , R, •) and 
the quasivector space = (R^, +,R, *) is a quasivector space. 

Example 2. The system (/C, +) of all convex bodies [7] in a real m-dimensional 
Euclidean vector space E™ with addition: A + B = {a + b \ a G A,b G B},A,B G 
1C, is an abelian monoid with cancellation law having as a neutral element the 
origin “0” of E"*. The system (/C,+,R, *), where is multiplication by real 
scalars defined by: j * A = { 7 a | a G A\, is a quasilinear space (of monoid 
structure), that is the following four relations are satisfied: i) 7 * (A + i?) = 
7 * A + 7 * B, ii) a*{P*C) = (aP) * C, iii) I* A = A,\y) {a + P)*C = a*C + 
P*C, if 0/3 >0 [ 4 ]. The monoid {1C, +) induces a group of generalized convex 
bodies (I?(/C), +), According to Proposition 2 . 1 . the space {T>{JC),+, R, *), where 
is defined by (1) is a quasivector space [3]. A centrally symmetric convex 
body with center at the origin will be called centered convex body, cf. [7], p. 383. 
Centered convex bodies do not change under multiplication by —1. 



Definition 3. Q is a quasivector space. An element a G Q with a ^ a = 0 is 
called linear. An element a G Q with a = a is called origin symmetric. 
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Proposition 2. Assume that Q is a quasivector space. The subsets of linear and 
symmetric elements Q' = {a £ Q \ a ^ a = 0}, resp. Q" = {a £ Q \ a = ^a} 
form subspaces of Q. The subspace Qf is a vector space called the vector (linear) 
subspace of Q. 

The space Q” = {a £ Q \ a = ^a} of centered elements is called the sym- 
metric centered subspace of Q. For symmetric elements b £ Q" the following 
relations are equivalent: b = ^b 4=^ 6+ 6_ = 0 4=^ 6_ = opp(6). The following 
theorem shows the important roles of symmetric quasivector spaces. 



Theorem 1. [4] For every quasivector space Q we have Q = Q' @Q". More 
specifically, for every x £ Q we have x = x' x" = {x'\ x") with unique x' = 
(1/2) * (a; + X-) £ Q! , and x” = (1/2) * (x x) G Q". 



Symmetric Quasivector Spaces. Let (Q, +,K, *) be a symmetric quasivector 
space over K. Define the operation : K x Q — > Q by 



J a* c, if a > 0, 
( a * c_, if a < 0, 



(5) 



where cr('y) = {+, if 7 > 0; — , if 7 < 0} and c+ = c. 



Theorem 2. [3], [4] Let (Q,+,K, *) be a symmetric quasivector space overM.. 
Then (Q,+,K, •), with defined by (5), is a vector space overM.. 



Note that for a centered, the element (— l)-a = (— l)*a_ = a_ is the opposite 
to a, that is a + (— 1) • a = 0, resp. a + a_ =0. 

Linear Combinations. Assume that (5,+,K, *) is a symmetric quasivector 
space and (5,+,]R, •) is the associated vector space from Theorem 2. We may 
transfer vector space concepts from (5, +, K, •), such as linear combination, linear 
dependence, basis etc., to the original symmetric quasivector space (5, +,K, *). 
For example, let . . . , be finitely many elements of S. The familiar 

linear combination / = = cti ■ + 02 ■ + ... + ak ■ 

Oi, 02, Ofe £ R, in the induced vector space (5, +, K, •), can be rewritten using 
(5) as 

/ = «i * 

Thus (6) is a linear combination of ..., G 5 in the symmetric qua- 

sivector space (5, -|-,K, *). Similarly, the concepts of spanned subspace, linear 
(in)dependency, linear mapping, basis, dimension, etc. are defined, and the the- 
ory of vector spaces can be reformulated in {S, -l-,K, *) [3,4]. 

Theorem 3. [4] Any symmetric quasivector space over K, with a basis of k 
elements, is isomorphic to = (K^, -|-,K, *). 



250 



Svetoslav Markov and Dalcidio Claudio 



3 Computation with Centered Zonotopes in the Plane 

Using the theory of symmetric quasivector spaces that we briefly outlined in 
the previous section, we consider next a class of special convex bodies, namely 
centered zonotopes in the plane. 

Every unit vector in the plane e = e{ip) = (cos (/?, sin G ip € [0,7 t), 
defines a centered segment e with endpoints — e and e: 

e = conv{— e, e} = {Ae | A G [—1, 1]}. 

For p G M denote s = pe. Multiplication of a unit centered segment e by a 
scalar p G K is: 



s = p * e = {pej = conv{— s, s} = {Ape | A G [—1, 1]}. 

More generally, multiplication of a centered segment (pe)~ by a scalar 7 G K 
gives 7 * (pe)“ = (( 7 p)e)“. Note that —1 * s = s; more generally, (— p) * s = p * s 
(whereas, for comparison, we have, of course, (— p)s yf ps). 

Minkowski addition of colinear centered segments is {picj + (p 2 e)“ = ((pi + 
P 2 )e)l To present Minkowski addition of noncolinear centered segments, assume 
0 < :pi < :p 2 < 7T and denote = (cos :pi, sin tpi), = (cos :p 2 , sin (p 2 ). The 
points 

= pie^^^ = (pi cos :pi. Pi sin :pi), 

= P 26 ^^^ = (p 2 cos v? 2 , P 2 sin :p 2 ), 

pi,P 2 G K, define two noncolinear centered segments G The Min- 

kowski sum -I- is a centered quadrangle (parallelepiped) P with vertices 
where The 

perimeter of P = convjt^^), is 4(pi -|- P 2 ) and the area of P is 

4pip2 sin(<p 2 - i^i). 

Assume that we are given k fixed centered unit vectors G 

in cyclic anticlockwise order, such that 0 < <pi < <p 2 < ■■■ < <Pk < 

A system of centered unit vectors = e{pi) G satisfying 

0 < ipi < ... < ipk < will be further called regular, the same notion will be 
used for the induced system of centered unit segments {e*-®^}. In particular, the 
system {e{pi)} with <pi = 7 r(f — l)/k, i = 1, ..., k, is regular; for this system we 
have <Pi+i — (fii = ir/k = const. 

For Oi > 0 the vectors = OiC^®^ = (a^ cos pi, at sin pi) induce the centered 
segments s^®^ = Oi * e*-®^ = (aiC*-®^)”, i = 1 , ..., k. The positive combination of the 
segments s^®^ 



k k 

z = s^®^ = ai * e*-®\ Ui > 0, 

i=l i=l 



( 7 ) 
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is a centered zonotope with 2k vertices: [7, 

8], where 

+ 026 ^^^ + ... + + akC^^\ 

+ 0 : 26 ^^^ + ... + + Uke^^\ 



= _Q,ig(i) _ _ _ Q,._ig(* 1 ) _|_ Q,.g(») _|_ _ ^_ Q,^g(fe)^ (8) 



— a2C^^^ + ... — ak-ie^^ + Uke^^\ 

The vertices given by (8) are lying in cyclic anticlockwise 

order in a half-plane between the vectors and + 2ak&^^'^ ■ 

Two centered zonotopes B = ^ = Si=i 7i*e^*\ (ii > 0, 7^ > 0, 

over same system are added by 

k I k 

B + C = ^!3i* 

i—1 i—1 i—1 

Thus given a fixed regular system of centered unit segments the 

set of all zonotopes of the form A = is closed under 

Minkowski addition and multiplication by scalars and forms a quasilinear space 
(of monoid structure) [3,4]. 

Consider two zonotopes B = J2i=i Pi * ^ ^ = Si!li 7» * where 

{m are two distinct regular systems of centered unit segments. 

From the expression for the sum: B + C = X)!=i Pi * + SHi 7* * see 

that the vertices oi B + C can be restored using (8), where is the union 

of the two sets (after proper ordering). Clearly the number 

of vertices of B -I- C equals (generally) the sum / -|- to of the numbers of vertices 
of B and C, resp. If we want to use a fixed presentation of the zonotopes of the 
form (7), then we need to present (approximately) all zonotopes using one and 
the same system of centered unit segments. 

4 An Approximation Problem 

The Problem. Assume that a regular system of centered unit segments ..., 
g(fc)^ g(i) _ ^gQg sin ipi), i = 1, ..., k is given, which will be considered as basic. 
Assume that is a regular system of unit centered segments, distinct 

from the given system We want to approximate a given zonotope of 

the form w = "YPP^iPi *P^'"\pi > 0, by means of zonotopes from the class 
2: = ^i * so that w C z. 

The Algorithm. Given are unit vectors pb) = (gos ipi^ sin tpi), i = I , ..., to, such 
that 0 < '01 < P>i+i < 7’’) Emd nonnegative numbers Pi > 0,i = 1, ...,to, defining 
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a zonotope w = YllLi ■ We want to find suitable values {£i}fUi such that 

the zonotope z = * 6^*^ i® an outer approximation of w, that is w C 2 . 

We present the vector = (cos'i/'ijSin'!/;^) as 

pb) = £^igO) £.-2eb+b^ j = j(j)^ (9) 

where are the nearest basic unit vectors enclosing p(®^ with ipj < tpi < 

<Pj+i and £ii, Si 2 are some nonnegative coeficients. Clearly, relations (9) define 
the coefficients £ii, £^2 in a unique way. We note that if some of the equalities 
(fij = ipi, ipi = ipj+i takes place, then one of the coefficients £^ 1 , £i 2 will be equal 
to zero. 

Relations (9) imply the inclusions p^'^ C en * + Si 2 * * = 1, m. 

Subsequently we obtain 

mm k 

w = Pi* P^*^ C ^ Pj * {Sii * + £j2 * ) = ^ = 2 , (10) 

2=1 2=1 2=1 

with some £i > 0 that can be effectively computed. 

It follows from (10) that the zonotope 2 is an outer approximation of the 
zonotope w. From (10) one can compute the vertices of the zonotope 2 by means 
of (8). It can be shown that the above algorithm produces an optimal approx- 
imation as regard the Hausdorff/integral metric. Roughly specking this follows 
from the fact that every single segment p*-®^ has been optimally approximated. 
Note that the area of the zonotope en * -|-£i 2 * is 4£ii£i2 sin(pj_|_i — pj). 

This can be used to compute the area of the zonotope 2 . 

Relations (9), (10) define an algorithm leading to the construction of an 
outer approximation 2 of w. Such an algorithm using a uniform mesh of centered 
segments in of the form = (cos pi, sin pj), pi = Tr{i — l)/k, i = I, ...,/c, 
has been realized in MATLAB. It has been demonstrated that when the number 
k of mesh points increases, the zonotope 2 approaches the original zonotope w 
in Hausdorff sense. 

The Method of Support Fhnctions. Let us compare our method with a 
corresponding method based on support functions. The support function of a set 
A e is h{A; u) = maxa;g^(a;, u); this function is well defined by its values on 
the unit circle C. For u = (cos 9, sin 9) G C we can easily calculate h{A; u) for any 
zonotope A. Let first A be a centered unit segment e with e(p) = (cos <p, ship). 
We have h{e; u) = h{e{(p); u) =| cos p cos 0 -|- sinpsin0 | = | cos{9 — p) |. 

Similarly we can write down the support function of the zonotope 2 = 
where = (cos p*, sin pi), i = l,...,fc, 0 < pi < ... < p^ < tt. 
We have h{z, 9) = J2i=i £*/i(e^*^; 0) = I cos(0 - p*) |. 

The above approximation problem can be stated in terms of support functions 
as follows. Given some ipiA = l,...,m, 0 < < ... < 4’m < tt, we want to 

approximate from above in the interval [0,7r] the function w{9) = S'* I 

cos(0 — '0i) I by means of a function of the class z{9) = I cos(0 — pi) |, 

that is we need to find the £’s in 2 so that z > w and 2 approximates w. 
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Concluding Remarks 

From the theory of quasivector spaces it follows that we can compute with zono- 
topes as we do in vector spaces unless the zonotopes belong to the same space, 
that is to be presented by the same basic system. This makes it important to be 
able to approximately present any zonotope by a zonotope from a chosen basic 
class. We construct a simple algorithm for this purpose which can be effectively 
implemented in a software environment. Our approach is alternative to the ap- 
proach based on support functions, extensively used in the literature on convex 
bodies and seems to be more direct and more simple. From the formulation of 
the approximation problem in terms of support functions we see that the latter 
formulation is in no way easier than the direct formulation in terms of zonotopes. 
We thus conclude that in some cases direct computation with zonotopes may be 
preferable that related computations based on support functions. Due to their 
simple presentation zonotopes can be used instead of more traditional convex 
bodies like boxes, parallelepipeds, ellipsoids etc. 
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Abstract. Interval arithmetic is able to be applied in the case that 
we evaluate the ranges of polynomials. When we evaluate the ranges 
of polynomials by applying the interval arithmetic, the problem that 
interval widths of the ranges increase extremely exists. Horner’s method 
is widely known as the evaluating method which mitigates this problem. 
The purpose of this paper is to propose the new methods which are 
able to mitigate this problem more efficiently than the Horner’s method. 
And in this paper, we show and compare the efficiencies of the each new 
method by the results of some numerical examples. 



1 Introduction 

Interval arithmetic, e.g. [1], is able to be applied in the case that we evaluate the 
ranges of polynomials. When we evaluate the ranges of polynomials by applying 
the interval arithmetic, the problem that interval widths of the ranges increase 
extremely exists. Horner’s method is widely known as the evaluating method 
which mitigates this problem (which supply ranges whose interval widths are 
narrow) . The purpose of this paper is to propose the seven new methods which 
are able to supply ranges whose interval widths are narrower than the interval 
width of ranges which the Horner’s method supplies. And in this paper, we show 
and compare the efficiencies of the each new method by the results of some 
numerical examples. 

2 Conventional Methods 

2.1 Standard Method 

In this method, range of polynomial of degree n: 

QnX^ + -I- -I h a2X^ + a\X + ag (1) 

is calculated by the interval arithmetic without any contrivances. Interval widths 
of ranges increase extremely when this method is applied. 

2.2 Horner’s Method 

In this method, formula (1) is transformed as follows: 

((• • • ((a„x + an-i)x + an- 2 )x -!-••• a 2 )x + a\)x + ag- (2) 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 254-261, 2004. 
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3 New Methods 

In this section, we propose seven new methods. Here, let the polynomial of degree 
n be f{x) or fn{x) and let the domain be [x,x\. 



3.1 New Method 1 

X ~\~ "x 

Let c be the center of the domain. Namely, c = — . By introducing the c, the 

formula (1) is transformed as follows: 

bn(x - c)" + bn-i{x - H h b 2 {x - c)^ + bi{x - c) + bo- (3) 

When the formula (1) is transformed into the formula (3), following two efficien- 
cies are expected. 

(efficiency 1) 

Generally, for a domain x and its center c, following formula is satisfied: 

max(|x|, |x|) > |x — c| = |ir — c|. (4) 

Therefore, following formula is satisfied: 

w{x^) > w{{x — c)"^) (n = l,2,---). (5) 

Note that w(x'^) and w((x — c)") are the interval widths of a;" and (x — c)". 

(efficiency 2) 

By introducing r, radius of domain x, x — c is able to be described as follows: 

X — c = [— r, r]. (6) 

Therefore, for an even number m (m = 2,4, • • • , m < n), following formula 
is satisfied: 



(x-c)™ = [0,r'"]. 

In this way, negative part is able to be eliminated. 



(7) 



3.2 New Method 2 

In this method, by introducing X = {x — c)^ = [0,r^], the formula (3) is trans- 
formed as follows: 

f{x)={{- ■ ■ {{bpX + bp- 2 )X -\- bp-i)X -!-■•• bi)X -\- b 2 )X 

+{x — c)((- • • {{bqX + bq- 2 )X ~i~bq-i)X + ■ ■ ■ b 3 )X + b\) + bQ . (8) 

Note that p and q are the maximum even number and odd number which do not 
exceed n. And these p and q are described as follows: 

p= n — {n mod 2) (9) 

q = n — l + {n mod 2). (10) 

By transforming the formula (3) into the formula (8) , Horner’s method is able 
to be applied maintaining most of two efficiencies of the new method 1. And this 
method is able to be applied in smaller time than the new method 1 because of 
application of Horner’s method. 
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3.3 New Method 3 

Introducing t G [— r, r], let the formula (3) be as follows: 

H \-b2t'^ + bit + bo. (11) 

In this method, the formula (11) is transformed as follows: 

(■ ■ ■ ((o;)t^ + bit + + ■ ■ ■ + 62)^^ + bit + 6q. (12) 



Note that 

_ f + (n mod 2 = 0) f n — 3 (n mod 2 = 0) , ^ 

^ \ (n mod 2 = 1) ’ ( n — 2 (n mod 2 = 1) ' 

The range of the formula (12) is obtained by calculating the maximum value and 
the minimum value of each parenthesised formula (quadratic interval bounded 
function) successively. Concrete step of this method is as follows: 

Stepl. Calculate po and qo which satisfies the formula (14). 

Po = min a, go = max a (14) 

—r^tKr —rKt^r 

Step2. Let m be as follows: 

n+(nmod2) 

m= 1. (15) 

And calculate following pi,p 2 , - ■ ■ ,Pm and qi,q 2 , - ■ ■ ,qm- 
pi= min {[po,qo]t'^ + bit + h-i) 

—rKt<r 

= min( min {put^ + bit + bi^i) , min {qnt^ + bit + bi^i)) 

—r'^tKr —rKt'^r 

qi = max {[po, qo]t'^ + bit + h-i) 

—r<t<r 

= max( max {p^t^ + bit + bi-i) ^ max {qnt^ + bit + bi-i)) 



Pm = min( min + bit + 60 ), min {.q-m-it^ + bit + &o)) 

—r'^tKr —r'^tKr 

qm = max( max {pm-it^ + bit + bo), max {qm-it^ + bit + bo))- 

— rKtKr —r^tKr 

Step3. The range is [pm,qm]- 

3.4 New Method 4 

In this method, the formula (11) is transformed as follows: 

(■ ■ ■ ((o:)t^ + bit^ + bi—it + bi—2)t^ + • ■ • + ^s)t^ + ^2^^ T bit + bo. (16) 
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Note that 

{ bnt^ + + bn- 2 t + bn-3 (n mod 3 = 0) i 3 = 0) 

+ bn-it + bn -2 {n mod 3 = 2) ^ I = } n — 3(n mod 3 = 2). 

bnt + bn-i (n mod 3=1) I ~ 2(n mod 3 = 1) 

The range of the formula (16) is obtained by calculating the maximum value 
and the minimum value of each parenthesised formula (cubic interval bounded 
function) successively. Concrete step of this method is similar to that of the new 
method 3 mostly. 

3.5 New Method 5 

Let the quotient and the surplus when fn(x) is divided by its derivative /^(a;) 
(degree: n — 1) be g{x) (degree: 1) and fn- 2 {x) (degree: n — 2). And introducing 
these g{x) and fn- 2 {x), following formula is satisfied: 

fn{x) = f^{x)g{x) + fu-2{x)- (17) 

Here, when we put the x which satisfies f'{x) = 0 into x' {x' G [x,x]), following 
formula is satisfied by substituting x' for the formula (17) because f^{x')g{x') = 
0 : 

fn(x') = fn- 2 {x'). (18) 

By this formula, the following formula is satisfied: 

fn{x') G [ min_/„_ 2 (x), max_/„_ 2 (x)]. (19) 

x<x<x x<x<x 

Therefore, the range of fn{x) is able to be included even if the upper bound and 
the lower bound of the range of fn- 2 {x) are nominated instead of the extreme 
values of fn(x) in a: G [x,x]. Accordingly, we have to calculate the upper bound 
and the lower bound of the range of fn- 2 {x). The upper bound and the lower 
bound of the range of /„_ 2 (a:) are able to be calculated by repeating the above 
mentioned operation and lowering the degree of fn- 2 {x) until extreme values 
are able to be calculated easily. Concrete step of this method is as follows: 

Stepl. Calculate fn{x) and fn(x), and nominate them for the upper bound and 
the lower bound of the range of fn{x). 

Step2. Calculate the surplus /„_ 2 (x) when fn{x) is divided by its derivative 
fn(x). And calculate fn- 2 {x) and /„_ 2 (x), and nominate them for the upper 
bound and the lower bound of the range of /„ (x) . 

Step3. Repeat the step2 until /„_ 2 (x) becomes primary function or quadratic 
function. 

Step4. When /„_ 2 (x) becomes primary function or quadratic function, calcu- 
late the extreme values of fn- 2 {x) in x G [x, x]. And calculate fn- 2 (x) and 
fn- 2 {x), and nominate them for the upper bound and the lower bound of 
the range of fn{x). 

steps. In the obtained candidates, the upper bound and the lower bound of 
the range of /n(x) are the maximum value and the minimum value. 
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3.6 New Method 6 

Introducing /„(x) and its derivative /^(x), let /„_i(x) (degree: n — 1) be as 
follows: 

/„_i(x) = /„(x) - (20) 

Here, when we put the x which satisfies /'(x) = 0 into x' (x' G [x,x]), following 

formula is satisfied by substituting x' for the formula (20) because — /^(x') = 0: 

n 

/„_i(x') = /n(x'). (21) 

By this formula, the following formula is satisfied: 

/„(x') G [ min_/„_i(x), max_/„_i(x)]. (22) 

x<x<x x<x<x 

Therefore, the range of /n(x) is able to be included even if the upper bound and 
the lower bound of the range of /„_i(x) are nominated instead of the extreme 
values of /n(x) in x G [x, x]. Accordingly, we have to calculate the upper bound 
and the lower bound of the range of /„_i(x). The upper bound and the lower 
bound of the range of /n-i(x) are able to be calculated by repeating the above 
mentioned operation and lowering the degree of /„_i(x) until extreme values 
are able to be calculated easily. Concrete step of this method is similar to that 
of the new method 5 mostly. 

3.7 New Method 7 

In this method, well narrow intervals which include x satisfying f'{x) = 0 are 
calculated by bisection method. Concrete step of this method is as follows: 

Stepl. Calculate /(x) and /(x), and nominate them for the upper bound and 
the lower bound of the range of /(x). 

Step2. Let sx + t be the derivative of order n — 1 of /(x). 

(i) In the case that — G [x, x] 

s 

Step2-1. Calculate f{—t/s), and nominate it for the upper bound and the 
lower bound of the range of /(x). 

Step2-2. Let the intervals: x G [x, — ], x G [ — ,x] be the initial in- 

s , , s 

tervals of the bisection method in the derivative of order n — 2. And 
operate the bisection method proper times and calculate 

and narrow intervals which include the only one 

solution of /(”'“^)(x) = 0. 

Step2-3. Calculate the upper bound and the lower bound of the range of 
/(x) which the standard method is applied for the intervals 

and and nominate them for the upper bound 

and the lower bound of the range of /(x). 
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Step2-4. Let the intervals: x G G G [l'^ 

x] be the initial intervals of the bisection method in the derivative of order 
n — 3. And operate the bisection method proper times and calculate the 
well narrow intervals which includes the only one solution of f^'^~^\x) = 
0 and calculate the upper bound and the lower bound of the range of f{x) 
which the standard method is applied for these intervals and nominate 
them for the upper bound and the lower bound of the range of f{x). 

Step2-5. Repeat the operation similar to step2-4 for the derivative of order 
n — 4, n — 5, • • • , 1 and calculate the candidates of the upper bound and 
the lower bound of the range of f{x). 

(ii) In the case that — ^ [®, $^1 

s _ 

Step2-1. Let the interval x G [x, x] be the initial interval of the bisection 
method in the derivative of order n—2. And operate the bisection method 
proper times and calculate , the well narrow interval which 

includes the only one solution of f^'^~‘^\x) = 0. 

Step2-2. Calculate the upper bound and the lower bound of the range of 
/(x) which the standard method is applied for the intervals 

and nominate them for the upper bound and the lower bound of 
the range of /(x). 

Step2-3. Let the intervals: x G [x, x G ^\x] be the initial 
intervals of the bisection method in the derivative of order n — 3. And 
operate the bisection method proper times and calculate the well narrow 
intervals which includes the only one solution of = 0 and 

calculate the upper bound and the lower bound of the range of /(x) 
which the standard method is applied for these intervals and nominate 
them for the upper bound and the lower bound of the range of /(x). 

Step2-4. Repeat the operation similar to step2-3 for the derivative of order 
n — 4, n — 5, • • • , 1 and calculate the candidates of the upper bound and 
the lower bound of the range of /(x). 

Step3. In the obtained candidates, the upper bound and the lower bound of 

the range of /(x) are the maximum value and the minimum value. 



4 Evaluation of the New Method 

In this section, we show and compare the efficiencies of the each new method 
by the results of some numerical examples. Here, computer environment is 
CPU: Pentiumll 350MHz, memory: 128MB, OS: Free BSD 2.2.8 and compiler: 
g++ 2.7.2.I. 

4.1 Numerical Examples 

First, for the formula the formula (23) which the strict ranges is able to be 
calculated easily, the range, the interval width and the calculation time when 
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Table 1. Operating results of the formula (23) 



methods 


ranges 


interval widths 


times (/is) 


The strict range 


[-0.250000,15.750000] 


16.000000 


— 


Standard method 


[-276.60938,292.50000] 


569.10938 


5.023438 


Horner’s method 


[-124.87500,100.12500] 


225.00000 


2.742188 


New method 1 


[-9.483398,17.226563] 


26.709961 


6.835938 


New method 2 


[-8.006836,15.750000] 


23.756836 


5.179688 


New method 3 


[-2.047852,17.226563] 


19.274415 


6.976563 


New method 4 


[-12.673828,15.750000] 


28.423828 


10.468750 


New method 5 


[-2.250000,15.750000] 


18.000000 


4.726562 


New method 6 


[-14.250000,15.750000] 


30.000000 


8.265625 


New method 7 


[-0.391751,15.750000] 


16.141751 


100.195313 



Table 2. Operating results of the formula (24) 



methods 


ranges 


interval widths 


times ifis) 


Standard method 


[-17085227017,17085042617] 


34170269634 


10.42969 


Horner’s method 


[-4446754933 ,464 1 737430] 


9088492363 


6.304688 


New method 1 


[-816359.5646,622430.0751] 


1438789.64 


17.48438 


New method 2 


[-432706.2287,197860.9734] 


630567.2021 


10.82031 


New method 3 


[-432706.2287,197860.9734] 


630567.2021 


21.86719 


New method 4 


[-773212.5179,581417.4179] 


1354629.936 


18.79688 


New method 5 


[-122486.5436,-85667.06016] 


36819.4834 


17.10156 


New method 6 


[-812134.095,643192.3238] 


1455326.419 


26.81250 


New method 7 


[-104142.5277,-92199.93955] 


11942.58816 


4522.602 



the each method are applied, the strict range and its interval width are shown 
in the table 1. 

fix) =0.25x^ -2x^ + 5.5x'^ -6x + 2, a; €[0.5,5] (23) 

Note that the repetition time of the bisection method in the new method 7 is 
10. Next, for the formula (24), polynomials of degree 10, the range, the interval 
width and the calculation time when the each new method are applied are shown 
in the table 2. 

/(a;)=0.1a:i°-5a;® + 108.75x®-1350a;^+10545.5x®-53865a;^ 

+180920a;'‘-390900a;3+513288x2-362880a;+1.7, x€ [0.5, 9.5] (24) 

Note that the repetition time of the bisection method in the new method 7 is 30. 

4.2 On the Result of the Numerical Example 

By the numerical examples described in the section 4.1, the fact that the seven 
new methods are able to supply ranges whose interval widths are narrower than 
the range which Horner’s method supplies is confirmed although these new meth- 
ods need longer calculating times than the Horner’s method. The new method 2 



On Range Evaluation of Polynomials by Applying Interval Arithmetic 261 



is superior to the new method 1 in both of the interval width and the calculating 
time. The new method 3 supplies the ranges whose interval width are narrower 
than that of the new method 2, but the new method 3 needs longer calculating 
time than the new method 2. Superiority of the new method 4 and the other 
new methods is different in numerical examples. The new method 6 possesses 
the same character. Generally, the new method 5 is superior to the new method 
6 in both of the interval width and the calculating time. The new method 7 
supplies the range whose interval width is narrowest in all methods, but this 
method needs extremely long calculating time. 

5 Conclusion 

In this paper, we proposed the seven new methods which are able to supply 
the ranges whose interval widths are narrower than the interval width of the 
ranges which the Horner’s method supplies. And we showed and compared the 
efficiencies of the each new method by the results of some numerical examples. 
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Abstract. Many mechanical systems, involving interval uncertainties 
and modelled by finite element method, can be described by parameter 
dependent systems of linear interval equations and response quantities 
depending on the system solution. A newly developed hybrid interval 
approach for solving parametric interval linear systems is applied to a 
2D engineering model with interval parameters and the results are com- 
pared to other interval methods. A new technique providing very sharp 
bounds for the response quantities, such as element strains and stresses, 
is developed. The sources for overestimation when dealing with interval 
computations are demonstrated on a numerical plain strain example. 



1 Introduction 

All engineering design problems involve imprecision, approximation, or uncer- 
tainty to varying degrees [1,3]. In particular, mathematical models in environ- 
mental geomechanics cover a broad class of problems involving deformation of 
geomaterials. Since soil, rock and clay materials are natural ones, there is uncer- 
tainty in the material properties [6]. When the information about an uncertain 
parameter in form of a preference or probability function is not available or not 
sufficient then the interval analysis can be used most conveniently [3] . Many me- 
chanical systems, modelled by finite element method (FEM), can be described 
by parameter dependent systems of linear equations. If some of the parameters 
are uncertain but bounded, the problem can be transformed into a paramet- 
ric interval linear system which should be solved appropriately to bound the 
mechanical system response. This technique is usually called Interval Finite El- 
ement Method. The efforts for developing suitable interval EE methods started 
at mid nineties and attract considerable attention [1]. Contrary to structural 
mechanics, where significant effort is devoted to solving FE models involving 
interval uncertainties, this paper makes a first attempt to consider a 2D model 
with interval parameters related to soil physics and soil mechanics, applied to 

* This work was supported by the NATO GLG 979541 and the Bulgarian National 
Science Fund under grants No. 1-903/99, MM-1301/03. 
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engineering. Since safety is an issue in environmental geomechanics, the goal is 
to describe the response of the system under a worst case scenario of uncertain 
parameters varying within prescribed bounds. 

Like in FE analysis of most engineering systems [1, 3], the problem considered 
here can be stated as follows: Given a system of linear equations, that describe 
the response of a discretized structural mechanical model, in the form 



where K{p) is the global stiffness matrix of the system, F{p) is the global load 
vector (which might depend on uncertain parameters), U is the nodal displace- 
ment vector of the total discrete model, and p is the vector of uncertain input pa- 
rameters which are described as intervals p = {pi} = {[p~ , pf]}, i = 1, 2, ... , m, 
where p~ denotes the lower bound and pf denotes the upper bound for the 
corresponding parameter value. R is the vector of response quantities which can 
be expressed in terms of nodal displacements, U, of the model. In this work we 
consider response quantities: element strains, £e, and element stresses, CTe, 



where pe is the element parameter vector which is, generally, a subvector of the 
global parameter vector p, Ue is the subvector of element nodal displacements, 
amd matrices B, D are defined in Sec. 3. The problem is to find the range 
[R~ , of each response quantity due to the uncertainty present in the input 
parameters. A key issue is how to avoid the overestimation effect caused by the 
dependency problem in interval analysis. 

The solution of parametric linear interval equations (1) and subsequent esti- 
mation of the response parameters (3), (4) is discussed in this work. We extend 
the hybrid approach for solving parametric interval linear systems [2] to the 
subsequent bounding of the response quantities - element strains and stresses. 

2 Interval Methods 

2.1 Solving Parametric Interval Linear Systems 

Linear algebraic systems (1) usually involve complicated dependencies between 
the parameters involved in the stiffness matrix K{p) and the load vector F(j>). 
In this paper we assume that these dependencies are affine-linear, that is 



K{p) • U = F{p) 



( 1 ) 



and a response vector 



R = R{U) , 



( 2 ) 



Ee = B ■ Ue and 
CTe = D{pe) ■ B - Ue , 



( 3 ) 

( 4 ) 



m 



m 





264 Evgenija Popova et al. 



where Xij,f3i S (i,j = 1, . . . ,n) are numerical vectors, with n being the 

dimension of (1). When the m parameters take arbitrary 

values from given intervals [p^], the solution of (1) is a set 

SP = S {K{p),F{p), [p]) := {[/ e I K{p) ■ U = F{p) for some p G [p]} 



called parametric solution set (PSS). The PSS is a subset of, and has much 
smaller volume than the corresponding non-parametric solution set = 
F{[K], [F]) := {[/ G K" I 3F G [K] = K([p]),3F G [F] = F([p]) :K-U = F}. 
The simplest example of dependencies is when the matrix is symmetric. Since 
the solution sets have a complicated structure which is difficult to find, we look 
for the interval hull OF := [inf F, supF], whenever is a nonempty bounded 
subset of K”, or for an interval enclosure [17] of □77. 

There are many works devoted to interval treatment of uncertain mechanical 
systems [1] . A main reason for some conservative results that have been obtained 
is that the interval linear systems (1), involving more dependencies than in a 
symmetric matrix, are solved by methods for nonparametric interval systems (or 
for symmetric matrices). Although designed quite long ago, the only available 
iterative method [5] for guaranteed enclosure of the PSS seems to be not known 
to the application scientists, or at least not applied. The parametric Rump’s 
method is a fixed-point method applying residual iteration and accounting for 
all the dependencies between parameters. As it will be shown in Sec. 3, this 
parametric method produces very sharp bounds for narrow intervals but tends 
to overestimate the PSS hull with increasing the interval tolerances and the 
number of parameters. To overcome this deficiency, we designed a new hybrid 
approach for sharp PSS enclosures, that combines parametric residual iteration 
and exact bounds based on monotonicity properties [2]. 

Finding very sharp bounds (or exact in exact arithmetic) for the PSS is based 
on the monotonicity properties of the analytic solution U{p) = K{p)~^ ■ F{p). 
For large real-life problems, the computer aided proof of the monotonicity of 
U{p) with respect to each parameter is based on taking partial derivatives on 
(1) [4]. This leads to a parametric interval linear system 



K{P)^ 

OPu 



dF{p) _ dK{p) ~ 
dp, dp, ^ 



where [U] A F^ is a PSS enclosure. Let for fixed i, 1 < i < n 



Ll* = {v\ Sign( 



dU^ 



_dp. 



) = 1 }, 



LI' ={v\ Sign( 



m' 

.dp,_ 



( 5 ) 



- 1 } 



and U = {!,..., m}. Define numerical vectors p componentwise 



Pj ■= 






if j GL“% 



and Pj := 



pJ ifjGL“% 

p- iij&Ll' 



j = .(6) 



^p- if j G L+ 

Then the exact bounds of the PSS, U~ = inf{77P}i and = sup{77^’}i, 
[□-,[/+] = [K{p^')-F Fip'^'), K{p-^')-F F{p-^% 



i = 1, . . . , n 
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can be obtained by solving at most 2n point linear systems. This approach is 
extended in the next subsection for sharp bounding of the response quantities. 



2.2 Bounding the Response Quantities 



A naive interval approach in finding the range of the response quantities R{U), 
such as element strains (3) and element stresses (4), under the uncertainties in 
input parameters is to replace the exact hull (or its enclosure) [C7], found by some 
method, into the expressions (3), (4) and to perform all interval operations. Since 
the nonzero components of [C4] depend on the uncertain parameters, the range 
computation of R(Ue) implicitly involves the dependency problem implying a 
huge overestimation of the exact range, as will be shown in Sec. 3. To get sharp 
range estimations, we use the monotonicity properties of the responses. Taking 
partial derivatives on (3), (4) we obtain 



dse 

dpu 

due 

dpu 



= B 



dU,. 



dpu 
dPjPe) 
dpu 



(since B doesn't depend on p) 
' dse. 



[£e{p)] +D{pe) 



dpu 



V = 1, . . . , TO 



(7) 

( 8 ) 



where 
and 






is a subvector of the solution of (5), [£e{p)\ is the solution of (3), 

is taken from (7). Let for fixed i G {1, . . . , ni}, j G ri 2 }, where 

ni,ri 2 are the dimensions of £e, Ue respectively. 






LI- 



= I Sign( 

= I Sign( 



' d{ee}i 

dpv 

dpu 



) 

) 



1}, 



1 }, 



T+ = {v I Sign( 



9piy \ 



L+ = I Sign( 



dPv J'^ 



= - 1 } 

= - 1 } ■ 



Note, that the sets L“,L“, LI, Lfj_, L^^, L^ differ between themselves and for 
every fixed i, j. Define the numerical vectors for every 1 ^ i ^ ni, and 

^ for every 1 ^ j ^ ri 2 , analogously to (6). Then the exact bounds of £e 
are obtained as 



, {£e}t] = [{B}i-Ue{p^*), {B},-Ue{p ®')], f=l,...,ni . 

Note, that a rigorous very sharp enclosure can be obtained if instead of Ue{p^'), 
Ue{p~^') we take the corresponding components of the rigorous interval enclo- 
sures [C/(p'^-)], [C/(p“®-)] of the solution to the corresponding point systems (1). 
For fixed j G {1, . . . , ri 2 } 

inf{aeb = {D(p^O--B-t/e(p"' )b = WPeO ' )b 
snp{ae}j = {Dip-^^^) ■ B ■Ue{p~'^^)}j = {D{p~'^^) ■ e{p~'^^)}j . 

However, the above computation of the erg range is rigorous only in exact arith- 
metic. A rigorous and very sharp range enclosure of de in floating-point arith- 
metic should use rigorous interval enclosures [U{p'^^)], [C/(p“°'^ )] of the solution 
to the corresponding point linear systems (1) and computations with validated 
interval arithmetic. 
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3 Numerical Example 



We consider an elastic material model based on the following assumptions: small 
strain theory is applied to describe the deformation in material; the latter is de- 
formed elastic; material properties are isotropic; the temperature, creep and time 
dependent effects are not taken into account; the Elastic moduli in the material 
properties are considered to be uncertain and varying within prescribed bounds. 
For example, the prediction of soil behavior due to environmental changes, ex- 
perimental procedure and measurement errors of such properties is a typical 
situation. Let E be a representative volume and S be the boundary of this vol- 
ume. S = S'flIJS'o-, where Su is a boundary part with prescribed displacement, 
and Sa is a boundary part with prescribed stresses. The governing equations are 
as follows: 



Kinematics : 


1 . ^ 


in V 




Equilibrium : 


^ijj + = 0 


in V 


(9) 


Constitutive Law : 


d(Jij = dski 


in V 




Boundary Cond. : 


Ui = Ui on Su, cTijUj = Ti on So- , 





where Ui, i = 1,2,3, are the displacements, Sij is a strain tensor, bi is the 
body force, <7^ is the stress tensor, Dijki is the elastic tensor, Ui and Ti are 
the prescribed displacements and distributed load resp., n = {ni,n 2 ,n 3 } is the 
outward normal direction and rii are the corresponding directional cosines. 

The equilibrium equation (9), relating the stress vector {cr} to the body force 
{6} and the boundary force specified at the boundary S„ of V, is formulated in 
terms of the unknown displacement vector {u}. Using the principle of virtual 
work, the general equilibrium statement can be written in a variational form [7] 
as: 



{Se}^{a}dV- / {Su}^{b}dV- / {Su}^{T}dS =0 



L 



( 10 ) 



for a virtual displacement vector {i5m}. 

The resulting system of equations is created using mathematical formulations 
for FEM based on variational technique. A typical FE in the 2D plane strain 
case is defined by nodes i,j^k. Let the displacement at any point within the 
element {u}, be approximated by linear shape functions and nodal unknown 
displacement {ue\- With displacements, known at all points within an element, 
the strains {e} and stresses a at any point of the FE can be determined by nodal 
unknowns. These relationships can be written in matrix notations [8] as: 



{u} = N-{ue}, {e} = L ■ N ■ {Ue} = B ■ {Ue} , (11) 



{a} = D{pe) ■ {e} = D{pe) ■ B ■ {Ue} , (12) 

where iV is a matrix of shape functions, L is a matrix of a suitable differential 
operators, D{pe) is the elastic matrix involving Elastic modulus and Poisson’s 
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<1 




FE 


Young’s modulus 


1 


El = 200 GPa 


2 


E 2 = 205 GPa 


3 


Fa = 210 GPa 


4 


Ei = 220 GPa 


Poisson’s ratio 


1 


= 0.25 


2 


U2 = 0.26 


3 


U3 = 0.27 


4 


U4 = 0.28 



Fig. 1. 2D plane strain FEM model with 4 finite elements, where a = 1 m, g = 
lOOOTsTA'^/m^, and each FE has a different material property 



ratio, Pe denotes a parameter (or parameter vector), related to the material 
properties, to be considered as uncertain in the subsequent analysis. 

Substituting Eqs.(ll) and (12) into eq. (10) and taking into account that eq. 
(10) is valid for any virtual displacement, for an element e, we obtain a system 
of parametric linear equations 



K'=(pe) -U, = F<^ (13) 

where K^[pg) = D{pe)B dV is the element stiffness matrix which is func- 
tion of the uncertain material properties, = Jy 7V^{6} dV + dS is 

the element load vector. It could also depend of some uncertain parameters. The 
final parametric global stiffness matrix and global load vector based on FEM 
matrices (13) is obtained by using an element-by-element technique [8]. 

We consider a 2D plain strain example presented on Fig. 1. The model is 
discretized with four triangular EEs and the kinematic boundary conditions are 
presented on the figure. The corresponding material and geometrical data are 
also given therein. It is assumed that in each FE the Young’s modulus, B^, is an 
independent uncertain parameter varying within an interval. 

Three different parametric interval linear systems (1) are considered that 
involve: a) four uncertain parameters - Young’s modulus for each FE of the 
model is uncertain and varies independently in an interval [pi\ = [Ei — A, Ei + A]^ 
i = 1, ... ,4; b) two uncertain parameters - Ei = E 2 varying in [pi] = [Ei — 
A, El + A] and E 3 = E 4 varying in [^ 2 ] = [E 3 — A, E 3 + A]; c) one parameter - 
assuming that the four EEs of the model have same Elastic modulus, E, varying 
in an interval [p] = [E—A, E+A], where A is the degree of uncertainty measured 
in % from the corresponding nominal value. The three parametric systems are 
solved for different values of the tolerance A = 10%, 5%, 1%, and 0.1%. 
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Table 1. Percentage by which Rump’s method overestimates the exact hull of the PSS 
to (1) with varying number of parameters and varying parameter tolerances 





Qth 


DOF 


8*'* 


DOF 


10** 


‘ DOF 


12**1 


DOF 




parameters: 


parameters: 


parameters: 


parameters: 


A 


4 


2 1 


4 


2 1 


4 


2 1 


4 


2 1 


10% 


15.87 14.54 9.16 


19.45 14.50 9.16 


20.01 19.20 9.16 


27.65 19.19 9.16 


5% 


7.93 


7.36 4.79 


9.88 


7.36 4.79 


10.14 


9.81 4.79 


14.38 


9.81 4.79 


1% 


1.60 


1.50 0.99 


2.02 


1.50 0.99 


2.07 


2.01 0.99 


2.99 


2.01 0.99 


.1% 


0.16 


0.15 0.10 


0.20 


0.15 0.10 


0.21 


0.20 0.10 


0.30 


0.20 0.10 



Table 2. For the model with 4 uncertain parameters, the percentage by which a 
straightforward interval evaluation of (3) for the 4th FE based on: (R) Rump’s en- 
closure; (M) exact hull of U{p), overestimates the exact range of the element strains. 
(M/R) presents the ratio in % between the two strain estimations 



A 


2 


3 


A 


2 


3 


A 


2 


3 


A 


2 


3 


10% 


component 


5% 


component 


1% 


component 


.1% 


component 


R 


99.40 


98.15 


R 


99.30 


97.75 


R 


99.22 


97.41 


R 


99.20 


97.32 


M 


99.21 


97.56 


M 


99.20 


97.44 


M 


99.19 


97.34 


M 


99.19 


97.32 


M/R 24.97 


24.02 


M/R 12.85 


12.31 


M/R 


2.65 


2.53 


M/R 


0.27 


0.25 



A Mathematica package IntervalComputations ‘LinearSystems ‘ [2], pro- 
viding a variety of functions for solving parametric and nonparametric interval 
linear systems in validated interval arithmetic, and supporting the hybrid in- 
terval approach, is used for the computations. Due to the boundary conditions, 
only the 6th, 8th, 10th, and 12th degree of freedom (DOF) of U{p) are nonzero. 
We compare the results, obtained by the parametric iteration method, to the ex- 
act interval hull of the displacements. The overestimation of Rump’s enclosure 
is measured in % as 100 (1 — ui{\Z\S^) / uj{\U\)) , where \Z\E'p is the exact hull of 
the corresponding parametric solution set, \U] is the interval enclosure, obtained 
by the parametric iteration method, and w(-) is the width of an interval, de- 
fined as a;([a_, a“*"]) = — a~ . The percentage by which the iterative solution 

enclosure overestimates the exact hull of the corresponding parametric solution 
set is presented in Table 1. The computationally efficient iterative procedure 
gives sharp enclosures for small number of parameters and thin intervals for the 
uncertainties. However, the overestimation grows with increasing the number of 
parameters or the width of the intervals. 

For the model with 4 uncertain parameters, a straightforward interval eval- 
uation of (3) and (4) was done basing on interval estimations for the element 
displacements. Two interval estimations for the element displacements were used: 
interval enclosure, obtained by the parametric iteration method, and the exact 
interval hull. The results for the element strains are presented in Table 2. The 
results for the first component of the strains are not given since the component 
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is zero. A huge overestimation of the strains range is demonstrated, which is due 
to the parameter dependencies between the components of U^. The dependency, 
accumulated in the interval enclosure [fA] cannot be taken into account by the 
interval computation B ■ \Ue]- It is remarkable, that the difference in the interval 
estimations of U (by Rump’s method and the exact hull) is obscured by the 
inherited dependencies between the displacement components. 

The naive range estimation of the element stressess showed similar results. 
Note that the expression (4) involves additional dependencies due to the param- 
eters in the matrix D{pe). Accounting for the dependencies in D{pe) reduces the 
overestimation by 30-50%. 

Since the considered model is small, the exact interval hulls and the ex- 
act ranges were computed by symbolic-algebraic methods in the environment of 
Mathematica. Our hybrid approach produced very sharp bounds for both the so- 
lution set of the displacements system and the ranges of the response parameters. 
The difference to the corresponding exact values was within 10“^®-10“^®. 

4 Conclusions 

The interval analysis methods, presented in this paper, can be applied to all 
engineering problems whose behavior is governed by parametric interval linear 
systems and response quantities of type (1), (2). The hybrid interval approach, 
proposed in Sec. 2, is based on numerical proof of monotonicity properties and 
on a fast iteration method for enclosing PSS. By using validated interval compu- 
tations, this approach has the additional property that, owing to an automatic 
error control mechanism, every computed result is guaranteed to be correct. 

The Mathematica package [2] should be extended by suitable functions au- 
tomating the computations of the response quantities. Further research toward 
improving the computational efficiency of the method would allow large indus- 
trial applications. 
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Abstract. This paper deals with a set-valued version of the two-stage 
Runge-Kutta method with coefficients satisfying certain conditions. Ap- 
plying it to a differential inclusion which right-hand side is a strongly 
convex set-valued map, we obtain third order local accuracy of the ap- 
proximate reachable sets. 



1 Introduction 

We consider a differential inclusion 

xGF{x,t), x(to) € Xq, t € [to, T] , (1) 

where x G IR”, F{x, t) is a set- valued map F : H” x IR IR", Xq C K". 
An absolutely continuous function x{t), which satisfies (1) for a. e. t € [to, T] is 
called solution of ( 1 ). 

Different discretization methods for differential inclusions are available [1,2]. 
Most of the approaches are directly inspired from similar schemes for differential 
equations. In this paper we consider a set- valued version of the explicit two-stage 
Runge-Kutta method 

Xk+i € Xk + h{biy + b 2 z\ y G F{xk, tu + Cih), z G F{xk + ahy, tu + C 2 h)} , 
xo G Xq, tk = to + kh, k = 0,1, N - I , (2) 

where a, b\, 62 , ci, C 2 are real coefficients, h = . 

We introduce some notations and definitions. Consider the differential inclu- 
sion (1) with F defined in a neighborhood of a compact convex set S' x Z\ C 
IR" X IR, Int S 7 ^ 0. The reachable set R{X] ti, t 2 ) is the set of all points x, 
for which there is a solution x{t) of (1) on [ti, ^ 2 ] C A, starting from X C S 
(x{ti) G X), such that x{t 2 ) = x. The discrete-time reachable set R{x; t, t + h) 
is defined as 

R{x; t, t + h) = {x + h{biy + b 2 z)\ y G F{x, t + cih), 

z £ F{x + ahy, t + C 2 h)} . (3) 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 270-275, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The Hausdorff distance between two convex and compact sets Yi C H", 
Y2 C M" is given by 



H{Yi, Y 2 ) = max{\p{l\Yi) - p{l\Y 2 )\} , 

|i| — 1 



where p{l\Y) stands for the support function of the compact set Y: p{l\Y) = 

max{(Z, 2/)}; (•, •) is the scalar product in M”. The set Y C M" is called strongly 
v&y 

(/i-strongly) convex, if there exists pt > 0, such that yi &Y, y 2 &Y imply 

]^{yi+y 2 ) + ^\yi-y 2 \^l &Y for every |^| < 1 . 

If Y is strongly convex, and Z yf 0, then there is a unique element y{l, Y) G Y, 
such that (/, y{l, Y)) = p{l\Y). We shall also use the notations r(/, x, t) = 
p{l\F{x, t)), C = {lG IR”| 0.5 < \l\ < 2}. 

Following the paper of Veliov [3] , where under the assumptions (see below) 
A1-A3 (resp. A1-A3 and B) an estimate (resp. 0{h^)) for the local 

error of the Heun scheme (ci = 0, a = C2 = 1, 61 = 62 = 5) is obtained, we 
consider the Runge-Kutta method (2) with coefficients satisfying the conditions 



a = 



1 - Cl 

1 - 2ci ’ 



bi 



1 

2(1 -Cl)’ 



b2 



l-2ci 
2(1 -Cl)’ 



0 < Cl < -, C2 = 1 . 



(4) 



Without any additional suppositions we achieved the same order of accuracy of 
the approximate reachable set (3). 



Al. F is compact valued and there is p, > 0, such that F(x, t) is p-strongly 
convex for every x G S and t G A. 

A2. r(/, X, t) is differentiable with respect to x and t, ^ is Lipschitz continuous 
in {I, X, t) G CxSxA, and ^ is Lipschitz continuous in (/, t) G Lx A, uniformly 
in X G S. 



A3. Xq is compact, [to, T] C IntZ\, and i?(xo; to, t) C Int S' for every t G [to, 
T], 

B. y{l, F{x, t)) is Lipschitz continuous in t G A, uniformly in {I, x) G C x S. 



2 Main Result 

Theorem. Let the assumptions A1-A3 and the conditions (4) hold. Then for a 
given compact set So C IntS and [ti, t2] C Int A, there exists a constant ho > 0, 
such that 

H{R{xo] t, t + h), i?(xo; t, t + h)) = 0{h^^^) (5) 

for every xq G Sq, t G [ti, t 2 ], and h G (0,/io]- If, in addition, the assumption B 
holds, then hf!"^ in (5) can he replaced with h^. 
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We need the following two lemmas. 

Lemma 1. Let A1-A3 hold. Then given Sq C Int S' and [ti, C IntZ\, there 
exists a eonstant ho > 0, sueh that 

H{co R{xo\ t,t+ h), R{xq; t, t + h)) = 0{h^) 

for every xq € So, t G [ti, t 2 ], and h G (0, ho]. 

Lemma 2. Let A1-A3 and (4) he fulfilled. Then for a given eompaet set So C 
Int S and [ti, ^2] C Int Z\, there exists a constant ho > 0, such that 

\p{l\Rixo;t,t + h))-pil\Rixo;t,t + h))\ = 0{h^/^), |Z| = I (6) 

for every xq G Sq, t G [ti, ^2], and h G (0,/iq]. If, in addition, the assumption B 
holds, then in (6) can he replaced with . 

Proof (of Lemma 1). The assumptions A1-A3 imply Lipschitz continuity of F, 
hence the set { (jF(x, t)| (x, t) G S x A} is bounded. 

Let u G coi?(xo; t, t + h) be fixed. There exist yi G F{xo, t + c\h), Zi G 
F{xo + ahyi, t + C2/1), i = I, 2, . . . , n + 1, such that 

n+1 

li = xo + ^ ai{biyi + b2Zi) , 

n+1 n+1 

where Oi G [0, 1], ^ Oi = 1. Denote y = cciPi, 

i=l i=l 

R = Xo + h[biy + b 2 F{xo + ahy, t + C 2 h)] , 



zz = xo + h 



n+1 



biy + &2 ^ aiF{xo + ahyi, t + C2/1) 



i=l 



Obviously R and zz are convex sets. Since R C R{xq; t, t + h) and u G zz, 
the proof of Lemma 1 is obtained by estimating in succession 

dist(u, i?(xo; t, t + /i)) < max { — p(Z|i?) } 

[n+1 >1 

= 62/iniax < aiT{l, Xq + ahyi, t + C2/1) — t{ 1 , Xq + ahy, t + C2/1) > 

U=i J 

[n+1 >1 

= 62/1 max < ai[r{l, xq + ahyi, t + C2/1) — t{ 1 , xq + ahy, t + 02/1)] z 

U=i J 

[ n+1 j \ 

= 62/1 max j X! \ dx^^' ~ J 



= 0{h^) 



□ 
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Proof (of Lemma 2 ). Assumption A 2 implies that given Sq C Int S' and [ti, 12] 
C IntZ\, there is Lq > 0 , such that the equation 

i{s) = xo, s), l{t + h) = l, |Z| = 1 

has a unique solution l{s) on [t, t + h] for every xq € So, t e [ti, ^2], and 
h e (0, ho], satisfying l{s) G £. 

Using 62 > 0 , successively we obtain 

p{l\R{xo; t, t+h)) 



= {I, Xq) + h max max {{I, biy + 62^)} 

y^F{xoTt-\-cih) z£F{xo-\-ahy,t-\-C2h) 



= {l, Xq) + h max {bi {I, y) + b2r{l, xq + ahy, t + 02/1)} 

yeF{xo, t+cih) 



= {I, Xq) + h max < bi {I, y) + b2r{l, xo,t + C2h) + 

yeF{xo, t+c-ih) 



+b2ah ^0, t + C2h), y) \ + 0 {h-^) 



3 t 

T [b\l + b2ah—{l, xo, t + C2h), xq, t + c\h ) + 



= {I, Xq) + h 

+ b2T{l, Xo, t + C2/1) 
= {I, Xq) + h 



0 {h^) 



T ( bil{t + h) + b2ah^{l{t + h), Xq, t + C2/1), Xq, t + Cih ) + 



+ b2T{l{t + h), Xo, t + C2h) 



0 {h^) 



Taking into account that, under the special choice ( 4 ) of the coefficients, it follows 
the estimate 



3 t 

bil{t + h) + b 2 ah—(l{t + h), Xq, t + C 2 h) 



dr . 



= bil{t + c\h) + 61(1 — ci)hl{t + c\h) + b2ah—{l(t + Cih), xq, t + c\h) 



+ 0 {h^) 



1 



2 ( 1 -Cl) 

we arrive at 



l{t + cih) + 0 {hf) 



p{l\R{xo', t, t + h)) = {I, Xo) 



2(1 -Cl) 



[r(/(t + cih), Xo, t + cih) + 
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(1 — 2ci)r(^i + h), Iq, t + h)] + 0{h^) 



It is proven in [3] (Lemma 4, formula (16)), that on the assumptions A2 and 
A3, for every compact Sq C Int^ and [^ 1 ,^ 2 ] C IntZ\, there exists hg > 0, such 
that 

p{1\R{xq] t, t+ h)) = {I, xq) + j t{1{s), xq, s)ds + 0{h^) 

t 

for every xq £ Sq, t £ [ti, t 2 ], h £ (0, ho], and |Z| = 1. 

In the remainder of the proof we use the following trapezoidal formula for 
numerical integration of (j){s) = xq, s) 

t-\-h 

/ h 

r(/(s), Xq, s)ds = — r[r(Z(t + Cih), xq, t + cih) + 

2(1 - Cl) 
t 

+(1 - 2ci)r(Z(t + h), xo, t + h)]+ (resp. 0{h^)) 

uniformly in I, xq G Sq, and t £ [ti, 12 ]- The latter is obtained in a standard way 
by linearly interpolating (j){s) at the points {t+ c\h, 4>{i+ c\h)), {t + h, (f){t + h)) 
and considering the ^-Holder (resp. Lipschitz) continuity of </>(s). As it is seen 
from (see [3]) 



dr 



Hs) = { 2^0, s), l{s) ) + Xo, s) 



dr, 



ds 



= - (y{l{s), F{xo, s)), Xo, s)^ + Xo, s) 

(since F{xq, s) is a strongly convex set, and thus implying uniqueness of y{l{s), 
F{xq, s)) £ F(xo, s)), the corresponding continuity properties of (^(s) come as 
a consequence of the continuous differentiability of l(s), A2, and another result 
proven in [3] (Lemma 3 in [3]): The mapping (/, x, t) y{l, F{x, t)) is Lipschitz 
continuous in I uniformly in {x, t), and ^-Hdlder continuous in (x, t) uniformly 
in I, when I £ C, x £ S and t £ A. 

(^-Holder continuity means that there is a constant L such that \y{l, F{x\, ti)) — 
y{l, F{x 2 , ^ 2 ))! < T(|xi— X 2 | + |ti— ^ 2 !)^'^^ for every ti, t 2 , x\, X 2 , and I as above.) 

Thus, by applying Lemma 3 from [3] (resp. Lemma 3 from [3] in combination 
with the assumption B) we complete the proof. □ 

Proof (of the Theorem). It follows from Lemma 2 that 

H {co R{xo', t, t + h), CO R{xo; t, t + h)) = 0{h^^^) (resp.O(h^)) . 

Making use of Lemma 1 and the estimate (see [3], Lemma 2, formula (12)) 

H{R{xo', t, t + h), CO i?(xo; t, t+ h)) = 0{hf') , 
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by utilizing the triangular inequality 

H{R, R) < H{R, coR) + H (co R, coR) + H (co i?, R) , 
we complete the proof. □ 
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Abstract. To describe the response of engineering complex systems 
to various damage mechanics, engineers have traditionally use number- 
valued utilities to describe the results of different possible outcomes, 
and (number-valued) probabilities (often, subjective probabilities) to de- 
scribe the relative frequency of different outcomes. This description is 
based on the assumption that experts can always make a definite pref- 
erence between two possible outcomes, i.e., that the set of all outcomes 
is linearly (totally) ordered. In practice, experts often cannot make a 
choice, their preference is only a partial order. In this paper, we describe 
a new approach based on partial order. 



1 Introduction 

To describe the response of engineering complex systems to various damage me- 
chanics, engineers have traditionally use number-valued utilities to describe the 
results of different possible outcomes, and (number-valued) probabilities (often, 
subjective probabilities) to describe the relative frequency of different outcomes. 
This description is based on the assumption that experts can always make a def- 
inite preference between two possible outcomes, i.e., that the set of all outcomes 
is linearly (totally) ordered. 

In practice, experts often cannot make a choice, their preference is only a 
partial order, see, e.g., [1]. For example, one of the main criteria for a tank design 
is that the tank retain most of its functionality after a direct hit. It is, however, 
difficult to describe the remaining functionality (utility) by a single numerical 
value. Some designs place more protection on the tank’s weapons; so, when a 
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tank is hit, it will retain most of its shooting capabilities-but its movement 
abilities may be severely damaged. In other designs, there is more protection 
on the engine and on the tracks, so the tank may lose its shooting abilities-but 
keep its ability to move fast. It is difficult to make a definite selection because 
depending on the battlefield situation, different designs will work best: in active 
defense, an immobilized tank is still a valuable shooting force-and thus of much 
larger value than a moving tank with no shooting capabilities-while in a fast 
long-distance attack, an immobilized tank is practically useless. 

In such situations, when we use traditional totally-ordered techniques, we 
thus force an expert to make a more or less arbitrary choice between two difficult- 
to-compare outcomes; different choices lead to different numerical values of util- 
ities and probabilities and~as a result-to different decisions. 

It is therefore desirable to come up with a decision-making procedure that 
would be robust, i.e., that would only depend on the actual expert choices. 

The main reason why the traditional number-valued approach to decision 
making is widely used is that this approach is based on a solid foundation: there 
are axioms, principles that — if true — uniquely lead to probabilities, utilities, and 
corresponding techniques for decision making. Most of these principles are pretty 
reasonable, with the exception of one: that the corresponding ordering of alter- 
natives is “total” (“linear”). Traditional decision theory (see, e.g., [2, 3]) is based 
on the assumption that a person whose preferences we want to describe can al- 
ways (linearly) order his preferences, i.e., that for every two alternatives a and 
o', he can decide: 

— whether a is better than a' (we will denote it by a' -< a); 

— or whether a' is better than a {a -< a'); 

— or whether a and a' are (for this person) of the same quality (we will denote 

it by a ~ a'). 

A similar assumption (often implicit) underlies the traditional description of 
degrees of belief (“subjective probabilities”) by numbers from the interval [0,1]. 

As we have mentioned, in real life, an expert may not be able to always 
compare two different alternatives. In this paper, we provide an exact description 
of decision making under partial ordering of alternatives. In turns out that in 
general, the uncertainty of each situation is characterized not by a scalar linearly 
ordered quantity (probability), but by a matrix-type partially ordered quantity 
(ordered operator). 

Important particular cases are interval- valued probabilities and more general 
algebraic structures described by S. Markov and his group; see, e.g., [7,8]. 

Our results were partially published in [5] and [6]. 

2 Traditional Utility Theory: A Brief Reminder 

In this section, we will mainly follow standard definitions (see, e.g., [2,3]), but 
we will not always follow them exactly: in some cases, we will slightly rephrase 
these definitions (without changing their mathematical contents) so as to make 
the following transition to partially ordered preferences as clear as possible. 
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Definition 1. Let A he a set; this set will be ealled the set o/ alternatives (or 
the set o/ pure alternatives/ By a lottery on A we understand a a probability 
measure on A with finite support. 

In other words, a lottery is a pair (A,p), where A = {oi, . . . , a„} C ^ is a 
finite subset of A, and p is a mapping p : A ^ [0,1] for which p(oi) > 0 and 
^p(ai) = 1. A lottery will also denoted as p(ai) • oi + . . .+p(a„) ■ fln- We do not 
consider lotteries with infinite numbers of alternatives, because every real-life 
randomizing device, be it a dice or a computer-based random number generator, 
produces only finitely many possibilities. 

The set of lotteries will be denoted by L. On this set L, we can naturally 
define an operation of probability eombination as a convex combination of the 
corresponding probability measures: namely, if we have m values qi,. . .,qm & 
[0,1] with X) lotteries ij = {Aj,pj), then we can define the 

probability combination i = qi ■ ii + . . . + qm Am as a lottery £ = (A,p) with 
A = U Aj and p{a) = ^qj ■ Pj (a ) , where the sum is taken over all j for which 
a € Aj. 

Definition 2. Let A be a set, and let L be the set of all lotteries over A. By a 
preference relation, we mean a pair (^,~), where -< is a (strict) order on L, ~ 
is an equivalence relation on L, and for every £,£',£" S L and every p G (0, 1), 
the following conditions hold: 

1. if £ '^ £' and £' -< £" , then £ £" ; 

2. if £ ~< £' and £' ~ £" , then £ -< £" ; 

3. if £ ^ £' , then p ■ £ + (1 — p) ■ £" ^ p ■ £' + {1 — p) ■ £" ; 

4- if p ■ £ + {1 — p) ■ £" ~< p ■ £' + {1 — p) ■ £" , then £ £' ; 

5. if £ '^ £' , then p ■ £ + {1 — p) ■ £" ^ p ■ £' + {1 — p) ■ £" ; 

6. ifp-£+{l-p)-£”r^p-£' + {l-p)-£”, then£^£'. 

Definition 3. A preference relation is called linearly ordered (or linear, for 
short) if for every £,£' G L, either £ ^ £', or £'< £ (where £ < £' means that 
either £<£! or £ ^ £'). 

It is known that linearly ordered preference relations can be characterized in 
terms of special functions called utility functions: 

Definition 4. A function u from the set L of all lotteries to an ordered set V is 
called a utility function. For each £ G L, u{£) will he called a value of the utility 
function. We say that a utility function u describes the preference relation if for 
every £, £' G L, the following two conditions hold: 

— £ <£' if and only if u{£) < u{£'); 

— £ '^ £' if and only if u{£) = u{£'). 

Definition 5. A utility function u : L ^ V is called convexity-preserving if on 
the set V , convex combination pi ■ vi + . . . + p„ ■ Vn is defined for all pi > 0, 
/o’’ every pi and /, we have u{p\ ■ £\ + ... + Pm ■ £m) = 
Pi • u{£i) + ...+pm- u(£m)- 
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To describe linearly ordered preference relations, we use scalar utility func- 
tions, i.e., convexity-preserving utility functions for which = i?. It is known 
that for every convexity-preserving function u : L — > iZ, the relations u{i) < u(£') 
and u{i) = u{i') define a linearly ordered preference relation. It is also known 
that this utility function is determined uniquely modulo a linear transformation, 
i.e.: 



— If two different scalar utility functions u : L ^ R and u' : L ^ R describe the 
same preference relation, then there exists a linear function T{z) = k-z + m, 
with fc > 0, such that for every lottery i, u'{£) = T{u{£)). 

— Vice versa, if a scalar utility function u : L ^ R describes a preference 
relation, and k > 0 and m are real numbers, then the function u'(£) = 
T{u{£)) (where T{z) = k-z + m) is also a scalar utility function which 
describes the same preference relation. 

One can also show that every Archimedean (in some reasonable sense) linearly 
ordered preference relation (^,~) can be described by an appropriate scalar 
utility function. 

3 New Approach: 

Utility Theory for Partially Ordered Preferences 

It turns out that a similar result holds for partially ordered references as well. 
To describe this result, we need to to recall a few definitions. 

An ajfine space (see, e.g., [4] and references therein) is “almost” a vector 
space, the main difference between them is that in the linear space, there is a 
fixed starting point (0), while in the affine space, there is no fixed point. More 
formally: 

— A linear space is defined as a set V with two operations: addition v + v' and 
multiplication A • f of elements from V by real numbers X € R (operations 
which must satisfy some natural properties). With this two basic operations, 
we can define an arbitrary linear combination Ai • rii -I- . . . -|- A„ • of elements 
Vi, . . . ,Vn &V. 

— In the ajfine space, we can only define those linear combination which are 
shift-invariant, i.e., linear combinations with ^ A^ = 1. 

The relation between a linear space and an affine space is rather straightforward: 

— if we have an affine space V, then we can pick an arbitrary point Vq G V, are 

define a linear space in which this point is 0. Namely, we can define v + v' 
as 1 • r; -|- 1 • — 1 • I’d: since we took uq as 0, this linear combination will be 

exactly v + v'. 

— Vice versa, if we have a hyperplane H in & linear space, then (unless this 
hyperplane goes through 0) this hyperplane is not a linear space, but it is 
always an affine space. 



280 



Paul J. Tanenbaum et al. 



Definition 6. A vector space V with a strict order < is called an ordered vector 
space if for every v,v',v" G V, and for every real number A > 0 the following 
two properties are true: 

— if V < v' , then v + v" < v' + v" ; 

~ if V < v' , then X ■ v < X ■ v' . 

Since this ordering does not change under shift, it, in effect, defines an or- 
dering on the affine space. 

Definition 7. By a vector utility function, we mean a convexity-preserving util- 
ity function with values in an ordered affine space V . 

To analyze uniqueness of vector utility functions, we must consider isomor- 
phisms. A mapping T between two affine spaces is called affine if it preserves 

the affine structure, i.e., if ■ Vi) = ' T{vi) whenever "^Xi = 1. 

For finite-dimensional affine spaces, affine mappings are just linear transforma- 
tions {x\, . . . ,Xn) {yii ■ ■ ■ ,ym), i-e., transformations in which each resulting 
coordinate yi is determined by a linear function yi = Ui -\-Y^ bij ■ Xj . 

Definition 8. A one-to-one affine transformation T : V ^ V' of two ordered 
affine spaces is called an isomorphism if for every vi,V 2 G V , v < v' if and only 
ifT{v)<T{v'). 

Recall that for every subset S' C R of an affine space, its affine hull A(S) can 
be defined as the smallest affine subspace containing S, i.e., equivalently, as the 
set of all affine combinations JfXi-Si (X) Ai = 1) of elements from S. 

Theorem 1. Let A be a set, and let L be the set of all lotteries over A. 

— (consistency) For every convexity-preserving function u ■. L ^ V from L to 
an ordered affine space, the relations u{i) < u(i') and u{tj = u{£') define a 
preference relation. 

— (existence) For every preference relation (^,~), there exists a vector utility 
function which describes this preference. 

— (uniqueness) The utility function is determined uniquely modulo an isomor- 
phism: 

• If two different vector utility functions u : L ^ V and u' : L V' 

describe the same preference relation, then there exists an isomorphism 
T : A(u{L)) A{u'{L)) between the affine hulls of the images of the 

functions, such that for every lottery u'{i) = T{u{£)). 

• Vice versa, if a vector utility function u : L ^ V describes a preference 
relation, and T : A{u{L)) — > V' is an isomorphism of ordered affine 
spaces, then the function u'(£) = T{u{£)) is also a vector utility function, 
and it describes the same preference relation. 

Example. In the above tank example, it is natural to describe each possible dam- 
age outcome by a vector-valued utility {ui,u 2 ), where ui describes the tank’s 
shooting abilities and U 2 the tank’s moving abilities. This is, of course, a simpli- 
fied example, we also need to take into consideration communication capabilities, 
possibility of damage repair, etc. -which leads to a higher-dimensional utility vec- 
tor. 
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4 How to Describe Degrees of Belief (“Subjective 
Probabilities”) for Partially Ordered Preferences? 

In traditional (scalar) utility theory, it is possible to describe our degree of belief 
ps{E) in each statement E, e.g., as follows: We pick two alternatives uq and ai 
with utilities 0 and 1, and as the degree of belief in E, we take the utility of 
a conditional alternative “if E then a\ else ao” (or (if|ai|ao), for short). This 
utility is also called subjective probability because if if is a truly random event 
which occurs with probability p, then this definition leads to ps{E) = p: Indeed, 
according to the convexity-preserving property of a utility function, we have 

ps{E) = u{E\ai\ao) = p ■ u{ai) + {1 - p) ■ u{ao) = p ■ 1 + {1 - p) ■ 0 = p. (1) 

How can a similar description look like for partially ordered preferences? Before 
we formulate our result, let us first explain our reasoning that led to this re- 
sult. The linear-ordered case definition of subjective probability ps{E) can be 
rewritten as follows: for every two lotteries £, £' € L, we have 

u{E\e\i') = ps{E) ■ u{l) -h (1 - ps{E)) ■ u(f ), (2) 

or, equivalently, 

u{E\^\^') = ps{E) ■ {u{£) — u{£')) + u{£'). (3) 

In other words, we can interpret ps{E) as a linear operator which transforms the 
utility difference u{£) — u{i') into an expression 

u{E\£\i') - u{£') = ps{E) ■ {u{£) - u{£')). (4) 

It is, therefore, reasonable to expect that for partially ordered preferences, when 
we have multi-dimensional (vector) utilities with values in a vector space V, 
ps{E) would also be a linear operator, but this time from V to V (and not from 
i? to R). We will now show that this expectation is indeed true. 

Definition 9. Let A be a set, let L be the set of all lotteries over A, and let E 
be a formula (called event). By a conditional lottery, we mean an expression of 
the type Jf,Pi ' + J2dk ■ {E\£'^,\£'f) , where J2Pi + J2dk = l, and £i, and £'( 

are lotteries. We will denote the set of all conditional lotteries by L(E). 

The meaning of a conditional lottery is straightforward: with probability pi, 
we run a lottery £i, and with probability qk, we run a conditional event “if E 
then £'^ else 

Definition 10. Let A be a set, and let L(E) be the set of all conditional lotteries 
over A. By a preference relation, we mean a pair (^,~), where -< is a (strict) 
order on L{E), ~ is an equivalence relation on L(E), which satisfies conditions 
l)-6) from Definition 2 plus the following additional conditions: 
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1. then {E\i\i") ~ {E\£'\i''); 

2. ifi' ~ then {E\£\£') ~ {E\£\£''); 

I {E\p .i+(l-p). £'\n ~ p . {E\£\£") + (1 - p) • {E\£'\£"); 

5. {E\£\p -e + il-p)- £”) ~ p ■ {E\£\£') + {l-p)- {E\£\£"); 

6. {E\p ■£+{l-p)-£"\p-£' + {l-p)- £") ~ p ■ {E\£\£') + {l-p)- £" ; 

7. if£<£', then£< {E\£\£') < £' . 

The meaning of all these conditions is straightforward; e.g., 7) means that 
{E\£\£') is better (or of the same quality) than £ because in the conditional 
alternative, both possibilities £ and £' are at least as good as A. 

In accordance with our Theorem 2, the utility of such events can be described 
by a vector utility function. 

Definition 11. Let V he an ordered vector space. 

— A linear operator T : V ^ V is called non-negative (denoted T>0) if x>0 
implies Vx>Q. 

— A linear operator T is called a probability operator if both T and 1 — T are 
non-negative (where 1 is a unit transformation v v). 

Theorem 2. 

— Let u : L ^ V he a vector utility function and let T \ V ^ V be a strict 
probability operator. Then, a function u* : L{E) — s- V defined as 

u* \^^p.-£. + Y.qk-{E\£i\£l)]=Y.P,-u{£,)+Y,qk-u*{E\£i\^^^^ (5) 

\ i k / i k 

with u*{E\£\£') = Tu{£) -I- (1 — T)u(i'), is a vector utility function which 
describes a preference relation on L{E). 

— Let (^,~) he a preference relation on L(E), and let u : L{E) V be a 
vector utility function which describes this preference. Then, there exists a 
probability operator T : A(u{L)) — > V for which 

u{E\£\£') = Tu{£) + {1- T)u{£') (6) 



for all £ and £' . 

Thus, we get a generalization of subjective probabilities, from scalar values 
p G [0, 1] (which, in our description, correspond to scalar matrices) to general 
linear probability operators. 

In general, to describe a matrix, we need to describe its components, 
where n is the dimension of the space V . It turns out that to describe matrices 
that represent probability operators, much fewer parameters are needed: 

Theorem 3. Let V be an n-dimensional ordered vector space. Then, the set of 
all probability operators is at most n-dimensional. 
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Idea of the Proof. An ordering relation in an n-dimensional space is characterized 
by a closed convex cone of all non-negative elements such that R is a linear hull 
of the cone. This cone is convex hull of its extreme generators; thus, we can find 
n generators that form a base of a linear space V. To describe T, it is sufficient 
to know T{ei) for all i. For each generator > 0, the condition 0 < T{ei) < e, 
implies that T(ei) belongs to the same cone generator, i.e., that T(ei) = Xi ■ ei 
for some real number > 0. So, to describe T, it is enough to know n values 
A. Q.E.D. 

Comment. Other results show that for most ordered vector spaces, we need even 
fewer parameters. 
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Abstract. Two methods for Large Scale Air Pollution (LSAP) simu- 
lations over static grids with local refinements are compared using the 
object-oriented version of the Danish Eulerian Model. Both methods are 
based on the Galerkin hnite element method. The first one is over a 
static locally refined grid - we call it Static Local Refinement Algorithm 
(SLRA). We compare SLRA with the Recursive Static- Regridding Al- 
gorithm (RSRA), in which regular grids with finer resolution are nested 
within a coarser mother grid. RSRA and SLRA are compared with the 
translational and rotational cone tests. The drawbacks and advantages 
of the methods are discussed. 

Keywords: Air pollution modeling, rotational test, translational test, 
local refinement. 

Subject Classifications: 65N30, 76R05. 



1 Introduction 

The base grids of all known large-scale air pollution models are uniform and too 
coarse to represent the local phenomenon well enough. For example, the mesh 
size of the grid in the operational two-dimensional version of the Danish Eulerian 
Model (DEM) is 50 km. It is difficult to resolve on this grid the enhancement (the 
supply of more details) of concentrations and depositions in urban areas, result- 
ing from the local emissions. In fact, the point source emissions are smeared out 
over the corresponding grid cell (which is too big) and as a result an unnatural 
diffusion is introduced into the model. The local phenomenon can be represented 
if much finer grid is used. However, an uniform finer grid about 10 km mesh step 
size, leads to 25 times more grid squares and will increase correspondingly the 
computational time of the algorithm. The concept of local grid refinement is a 
compromise between better resolution and computational expense in terms of 
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the computational time. In fact, higher resolution is mostly needed only inside 
certain subdomains of the model domain, such as the areas around the point 
sources and areas where strong gradients in the concentration field are observed. 
It is much more difficult to organize local refinement in some specific (user de- 
fined or with strong gradients) areas than to refine the whole model domain 
uniformly. The former though leads to much “cheaper” solution than the later, 
while both solutions are comparable. 

In this paper two algorithms for Large Scale Air Pollution (LSAP) simula- 
tions over static grids with local refinements are compared and discussed. They 
are both implemented in the object-oriented version of the Danish Eulerian 
Model [1, 11]. 

Both algorithms are based on the Galerkin Finite Element Method (GFEM). 
The first algorithm use a static locally refined grid. This algorithm is called Static 
Local Refinement Algorithm (SLRA). We compare SLRA and the Recursive 
Static-Regridding Algorithm (RSRA; see [6]). 

RSRA is designed for time evolution problems involving partial differential 
equations that have solutions with sharp moving transitions, such as steep wave 
fronts and emerging or disappearing layers [6]. A version of it is used in the 
LSAP model called EURopean Operational Smog model (EUROS) [3,4]. 

Both SLRA and RSRA use grids concentrated just over the areas of interest. 
The constraction of the grids is based on the principle of local uniform grid 
refinement, i. e. locally uniform refinement on some subdomains of an uniform 
grid. The refinements also could be nested. 

The remainder of the paper is organized as follows. Section 2 gives descrip- 
tions of the algorithms. Section 3 focuses on the rotational test - numerical 
results are presented and discussed. Section 4 describes two translational tests 
and their results. Gonclusions and future plans can be found in the last section. 
Section 5. 



2 The Two Algorithms 

2.1 Recursive Static-Regridding Algorithm (RSRA) 

When the solution advances from the time level to to to -I- At, the version of 
RSRA we use for the tests in the article has the following steps (see [6] for the 
original version of RSRA) : 

1. Integrate by GFEM on the coarse grid with time step At. 

2. Integration is followed by regridding. For the applications of RSRA described 
in this article, the regridding is always over the same subdomain. (This differs 
from the description in [6], where the domain of the refinement is chosen 
according to the sharp gradients of the coarse solution.) 

3. Regridding is followed by interpolation. Since the fine grid is static, we in- 
terpolate the concentrations at time level to and specify boundary values for 
the fine grid cells that abut on the coarse cells. We impose on these grid 
interfaces Dirichlet boundary conditions via interpolation of the values of 
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the concentrations at the time level to- Since we use rectangular grids, the 
interfacing cells form stripes along the edges of the finer grids. In the exper- 
iments reported in this article the width of these stripes is two cells (three 
points) . 

4. Finally, GFEM is applied on the finest grid over the time interval to < t < 
to + At. The so found refined values of the concentrations are injected in the 
coarse grid points. 

Multiple levels in RSRA are handled in a recursive fashion. It is clear from step 
4 that when this algorithm is employed, the simulations over the refined domain 
influence the solution through the injection of the fine grid results. We can say 
that RSRA assumes that after the injection of the refined grid solution into the 
coarser grid solution, the result will be nearly the same as the one with global 
refinement. 

The advantage of RSRA is that the same algorithm can be used for the coarse 
and the fine grids. This means that if we have an algorithm for simulations on the 
coarser regular grid, it can be reused in an algorithm for refined simulations. This 
is an algorithm which is easier to develop than one that seeks global solutions 
on the locally refined grid. The algorithms for regular two dimensional grids can 
use some splitting procedure to become simpler and faster. 



2.2 Static Local Refinement Algorithm (SLRA) 

The SLRA algorithm can be defined as follows. 

First, the coarse grid over the whole computational domain is generated. 
Then, a rectangular subdomain of the main domain that contains the region of 
interest is extracted. Finally, the grid cells belong to the selected subdomain are 
refined to the desired resolution. The refined subdomain is plugged in the coarse 
grid and then we generate the grid description suitable for our simulation code. 
Clearly, this algorithm can be applied for several regions of refinement as far as 
the refined regions do not overlap. (Let us mention that the RSRA algorithm has 
the same requirement.) Then GFEM is applied with basis functions generated 
on the obtained locally refined grid. 

3 Rotational Test 

The rotational test ([2,5]) is the most popular tool among the researchers in 
the fields of meteorology and environmental modeling for testing the numerical 
algorithms used in the large air pollution models. From the mathematical point 
of view, it is a pure advection problem which creates difficulties in the numerical 
treatment. We have changed the test to use the domain used in DEM: a square 
with size 4800 km. The test governing partial differential equation is: 

Oc dc dc 

_ = _(2400-.)^-(x-2400)^, 



( 1 ) 
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where 0 < a; < 4800 km, 0 < y < 4800 km. The initial cone has a center 
in the point (0.25,0.5) * 4800 km and radius 1/8 * 4800 km. Its height is 100 
m. All the test experiments presented bellow are inside the Object-oriented 
version of DEM (OODEM) [1]. Crank- Nichols on method and Restart GM- 
RES(SO) are used to solve the discretized problem. In fact, the codes are taken 
from the Portable, Extensible Toolkit for Scientific computation (PETSc) , which 
is a suite of uni- and parallel-processor codes for solving large-scale problems 
modeled by partial differential equations (see [9, 10]). The pre- and post process- 
ing procedures, including visualization tools, are made using Mathematica ([8]). 





Fig. 1. 16 X 16 coarse grid models of the refined grids used for the rotational tests. The 
grids used in the tests have 96 x 96 coarse, mother grids. On the left - RSRA-2:1; on 
the right - SLRA-2:1. 



Six numerical experiments are done in the framework of the rotational test. 
For each of the grids RSRA-2:1 and SLRA-2:1 (see Fig. 1) one complete rotation 
of the cone for 400,800, and 1600 number of steps was performed. 

It is important that the numerical methods which are applied to the cone 
test keep the height of the cone close to its initial one. The cone heights after 
the final steps of the six experiments are shown in Table 1. 



Table 1. Maximal values of the top of the cone for the rotation tests. 



steps 


400 


800 


1600 


RSRA-2:1 


92.2317 


92.3666 


92.2938 


SLRA-2:1 


92.5299 


92.0869 


92.2268 



We were not able to observe differences between the RSRA-2:1 results and 
those of SLRA-2:1 using 3D plots. Nevertheless, on very detailed contour plots 
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of the results obtained with the 1600 experiments, the RSRA-2:1 results look 
slightly better - see Figure 2. 





Fig. 2. Contour plots when the cone enters the refined region. On the left - RSRA-2:1; 
on the right - SLRA-2:1. The contours are at the concentration values 5, 7, 9, 11, 95. 



Another interesting phenomenon is the increase of the number of iterations 
in the RSRA-2:1 rotational experiment over the refined region when the cone is 
not on it - see Figure 3. The number of iterations become more uniform when 
the rotational steps are increased - for 1600 steps rotation they vary between 3 
and 5. Obviously this happens because of the convergence tester. With real data 
simulations this will not happen. 



25 

20 

15 

10 



3 . 0 
3 . 6 



3.2 



100 



200 300 400 



Fig. 3. Number of iterations for RSRA-2:1 rotational test with 400 steps. Fine grid - 
left, coarse grid - right. 



4 Translational Test 

Another way to compare the methods is using a translational test, i.e. a test in 
which the cone is translated in and out of the finer grid. As in the rotational 
test, we study how the described above two algorithms keep the height and shape 
of the cone close the initial ones. We did two experiments corresponding to the 
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two grids - RSRA-2:! and SLRA-2:! - each with constant wind magnitude 10 
m/s that is close to the typical maximum wind magnitudes in DEM. The wind 
has zero y-component, and it changes its s-direction every 240 steps. As in the 
rotational cone test, the cone initial position is on the finer grid at the point 
(0.25, 0.5) * 4800 km. The time step is 1000 s, which is close to the time step 900 
s used in DEM. The number of the steps is 240 in order the change in the wind 
direction to be at the points (0.25, 0.5) * 4800 km and (0.75, 0.5) * 4800 km. The 
results are shown on Table 2. We can see that RSRA-2 :1 cone height decreases 
much faster than the SLRA-2:1 cone height, which stays close to the initial one. 

We can explain these results in the following way. Let us assume we have 
RSRA grids with several levels of refinement. If a puff is located inside the most 
refined region, some of the concentrations it consists of will lie only on the refined 
grid but not on the coarser one. After the injection on step 4 of RSRA is done, 
the puff generally will be aliased with a smoother concentration distribution on 
the coarser grid. If the puff is blown away from the region of the refinement the 
detailed information will be lost - it will be only sampled on the coarser grids. If 
SLRA is used the blown away puff will be “incorporated” in the SLRA solution. 

To confirm the reasoning above we did further translation experiments. RSRA 
was used with three squares, telescopically nested grids. Each of the finer grids 
is twice finer than the one it is nested within. The region of the refinement was 
placed in the middle of the base grid. The width of the transition stripes for 
the finer grids is 3. The side size of the second grid is I /4-th of the base one; 
the third, finest grid covers the inner non-transitional area of the second. We 
denote RSRA with these grids with RSRA- 1:2:4. The corresponding SLRA grid 
is denoted with SLRA-1:2:4. 

Table 2. Maximum cone values for the translation tests. 



step 


RSRA-2: 1 


SLRA-2: 1 




finer grid 


coarser grid 


finer grid 


coarser grid 


0 


100 




100 




240 




93.4417 




95.6048 


480 


96.8807 




99.6673 




720 




93.5301 




95.6 


960 


96.0226 




99.5504 




1200 




93.592 




95.5989 


1440 


94.555 




99.487 





Initially the cone lies completely in the region with densest refinement. The 
cone is translated completely out of the refined region and then it is translated 
back to its initial position. For each method two experiments were made: one 
with wind being parallel to the x-axis with constant magnitude equal to 10 m/s, 
and one with the wind being parallel to the diagonal of the grid with constant 
magnitude equal to -\/200 m/s. The simulation time step is 1000 s. Table 3 
shows the maximal values at the initial, intermediate, and final stages of the 
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Table 3. Cone translation with horizontal and diagonal winds. The RSRA-1:2:4 values 
at the initial and final stages are taken from the finest grid; the intermediate value is 
taken from the base grid. 







Maximal values 


Method 


Wind 


Initial 


Intermediate 


Final 


RSRA-I:2:4 


a;-wind only 


100 


91.5307 


93.6504 


RSRA-1:2:4 


diagonal 


100 


90.0830 


94.4363 


SLRA-1:2:4 


a;-wind only 


100 


95.4348 


99.9965 


SLRA-1:2:4 


diagonal 


100 


94.5626 


99.9537 



experiments. On the intermediate stage the wind direction is alternated. We can 
see that SLRA-1:2:4 preserves the cone height very close to the initial one, and 
that RSRA- 1:2:4 dumps it with 6 — 7 %. 



5 Conclusions and Future Plans 

Two algorithms for Large Scale Air Pollution (LSAP) simulations over static 
grids with local refinements using the object-oriented version of the Danish Eu- 
lerian Model are compared and discussed - RSRA and SLRA. It is not possible 
to judge which method is better with the rotational and translational tests we 
have used. RSRA is easier to implement, but it has the puff smoothing effect 
described in Section 4. On the other side, in SLRA certain wave components 
of the puff will be reflected by the inner grid boundaries when the puff passes 
through them (see [7]). To estimate the significance of these RSRA and SLRA 
effects, we can extend the described experiments with some that include the 
chemistry reactions. Another possible extension is to make simulations with real 
data and to compare the results. 

It is easy with RSRA to have dynamic local refinements. To make a corre- 
sponding implementation of SLRA would mean dynamic invocation of the grid 
generator - so far in OODEM the grid generation is a preprocessing step. With 
RSRA this preprocessing is skipped. 

Last, an RSRA implementation with real data can be used to investigate 
if it is possible to obtain more accurate solutions over the refined region using 
Richardson extrapolation for the results of the different grids. 
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Abstract. In many transport-chemistry models, a huge system of 
ode’s of the advection-diffusion-reaction type has to be integrated in 
time. Typically, this is done with the help of operator splitting. Rosen- 
brock schemes combined with approximate matrix factorization (ROS- 
AMF) are an alternative to operator splitting which does not suffer from 
splitting errors. However, implementation of ROS-AMF schemes often 
requires serious changes in the code. 

In this paper we test another classical second order splitting introduced 
by Strang in 1963, which, unlike the popular Strang splitting, seemed 
to be forgotten and rediscovered recently (partially due to its intrin- 
sic parallellism) . This splitting, called symmetrically weighted sequential 
(SWS) splitting, is simple and straightforward to apply, independent of 
the order of the operators and has an operator-level parallelism. In the 
experiments, the SWS scheme compares favorably to the Strang split- 
ting, but is less accurate than ROS-AMF. 



1 Introduction 

Transport-chemistry models, describing the concentration changes of different 
chemical species (so-called tracers) in the atmosphere, are based on a PDF sys- 
tem of the form [10, 18]: 

dci 

ot 

where Ci denotes the concentration of the ith tracer. The linear differential oper- 
ator 7) describes the various transport processes, such as advection and diffusion, 
and in global models also cumulus convection. The non-linear term fi represents 
chemical reactions often including emission and deposition processes. 

Since, after spatial discretization, the number of grid-points in a modern air 
pollution model can range from a few thousand to a few hundred thousand, and 
the number of chemical species is typically between 20 and 100, the numerical 
integration of this system on long time intervals is a huge computational task. 
The requirements for accuracy and efficiency can hardly be satisfied if the terms 
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on the right-hand side are treated together. Moreover, these terms have different 
mathematical properties. For example, the chemistry and the vertical transport 
operators introduce stiffness to the system and thus require the application of 
a special implicit method. However, applying an implicit method to the whole 
problem would be too expensive. This difficulty is usually avoided by using some 
kind of operator splitting. The methods used in this field include, among others, 
sequential and Strang splitting [8] , the source splitting methods and Rosenbrock 
schemes with approximate matrix factorization (ROS-AMF) (for survey, see [10, 
18]). 

In 1963 Strang proposed a splitting method where a weighted sum of split- 
ting solutions, obtained by different ordering of the sub-operators, are computed 
at each time step [7]. Analysis of this method can be found e.g. in [2]. The 
symmetrically weighted sequential (SWS) splitting is second order accurate, and 
higher than second under some circumstances. These properties suggest that the 
weighted splitting schemes may be a good alternative to the traditional splitting 
methods. 

The main aim of this paper is to test the performance of these rediscov- 
ered splitting schemes in a simplified one-column version of a global transport- 
chemistry model. We address the following questions: 

— How does the symmetrically weighted splitting compare to the also second 
order Strang splitting? 

— How do these splitting methods compare to the ROS3-AMF scheme (third 
order Rosenbrock method with approximate matrix factorization), which 
proved to be a viable alternative to splitting methods in air pollution mod- 
eling [1,10]? 

The paper has the following structure. Section 2 describes the ROS3-AMF-I- 
method and the splitting schemes to be compared. In Section 3 a brief descrip- 
tion of our test model is given and the results of the numerical comparisons 
are discussed. The SWS splitting has nice parallelization properties, which are 
discussed in Section 4. Finally, conclusions are drawn in Section 5. 

2 Integration Methods 
2.1 ROS3-AMF+ 

ROS-AMF schemes are not splitting schemes, since the decomposition of the 
processes appears only on the linear algebra level (in AMF). The Rosenbrock 
time integration methods are a generalization of the well-known Runge-Kutta 
methods [3]. For the semi-discrete autonomous ODE system 

u = F(u) (2) 

the third order Rosenbrock method [4] reads as 

u"+i = u" H- |ki -k fka 
(I - 7 Zl<J)ki = Z\tF(u") 

(I - 7 Z\tJ)k 2 = AtF(u” -f |ki) - |ki. 



( 3 ) 
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where J denotes the Jacobian matrix F'(u") and 7 = ^ + This specific 7 
yields A-stability [4] . In our case the vector u, approximating the concentration 
function has mUz entries, where is the number of vertical layers. Further, 
F(u) = Vu + r(u) + E, where V is the vertical mixing matrix, r the semi- 
discrete chemical operator, E is the emission, and J = V-fR with R = ^(u"). 
There exist modifications of the above scheme in which, to reduce the costs, J 
is replaced by an approximate matrix. When standard AMF is used, 

(I - jAtJ) « (I - 7 AtR) (I - 7 AtV) . (4) 

The error of the above approximation is ( 7 At)^RV, which may be large. There- 
fore, an improved version of this scheme was developed, which is called ROS3- 
AMF-I-. Here the approximation 

(I - jAtJ) « (Lv - 7 AtR)Uv (5) 

is used, with the LU factors of I — jAtV = LvUv, diagUv = I. This approx- 
imation still has an error of 0{At^), but it often can be shown to be bounded 
by 7 Af||R||. Numerical experiments reveal an improved accuracy of AMF-I- [1]. 

2.2 Sequential and Strang Splitting 

Let <Pv{tmAt) and <pR{tn,At) denote the numerical solution operators applied 
to the sub-systems 

yi = Vyi and = r(y 2 ) -h E, (6) 

describing vertical mixing and chemistry with emission, respectively, on the in- 
terval {tn,tn+i]- The solution of the sequential splitting at tn+i can be 

expressed as 

y"+l = At)^v{tn, At)y”, (7) 

where the ordering of and <!>v is taken according to [6] . 

We use Strang splitting [8] in the form 

y"+l = <Pv{tn+lj 2 , ^^t)<Pn{tn+l/ 2 , ^At)<Pv{tn, ^'4t)y". (8) 

2.3 Weighted Sequential Splitting 

Another splitting scheme can be obtained by applying sequential splitting in 
both orders of the sub-operators and by taking a weighted average of the results 
in each time step according to the following formula: 

yn+l ^ At)<Pn{tn, At))yn + (1 - (t„ , At) )y” (9) 

where 0 G (0, 1) is a weight parameter. This method has second order for the 
choice 0 = 0.5, otherwise first order. If 0 = 0.5, the method is called symmet- 
rically weighted sequential (SWS) splitting, first proposed in [7]. The properties 
of this scheme on the continuous level were analyzed in [2] . 
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3 Numerical Comparisons 

3.1 The Test Problem 

For testing the performance of the methods discussed in Section 2, we chose 
a simple one-column model. The chemical scheme of this model is CBM-IV 
(Carbon Bond Mechanism IV), involving chemical reactions of 32 species. Emis- 
sions are set according to the urban scenario [5] (high emissions). The vertical 
mixing involves vertical diffusion and convection according to the TM3 global 
chemistry-transport model [9]. The number of vertical layers is 19. 

In our experiments the model is run for a period of five days starting with 
an initial concentration vector, taken as in [9]. The reference solution in our 
experiments is obtained by using a very small time-step size. The sub-problems 
in the splitting schemes were solved by the ROS3 method. 

In our comparisons we used time step At =15 min for all the methods. The 
computational costs were the same for all the methods compared. 

We remark that in the Strang splitting the solution depends considerably on 
the order of the operators, i.e., in the splitting (8) we could change the order of 
operators V-R-R-V to R-V-V-R. Indications in the literature concerning which 
order should be taken are ambiguous: Sportisse [6] advocates ending the process 
with the stiff operator, while Verwer et al. [10] suggest the other way for the 
Strang splitting. Therefore, both Strang splittings, Strang V-R-R-V and Strang 
R-V-V-R were included into the experiments. 

Generally, all the methods, ROS3-AMF-I-, SWS splitting, Strang V-R-R-V 
and Strang R-V-V-R give good results with relative errors below 10% in most 
cases. The most accurate method is unquestionably ROS3-AMF-I- for all of the 
tracers. The fact that the method which is not based on splitting appeared to be 
the best one, conjectures the crucial role of the splitting error in the global one. 
Among the other three methods, which all are based on splitting, it is difficult 
to find a clear winner. The Strang V-R-R-V method could be preferred to the 
SWS splitting and the other Strang method. The quality of the SWS solutions 
can be placed between those of the two Strang solutions. A typical case is shown 
in Figure 1 for layer 1. 

More precisely, Strang V-R-R-V was better than Strang R-V-V-R for 20 
tracers and than SWS for 18 tracers. SWS was better than Strang R-V-V-R for 
21 tracers. It is interesting to examine also the number of those cases where the 
errors were significant: 

— Comparing Strang V-R-R-V versus SWS splitting we see 10 tracers for which 

one of the schemes gave large errors (from which SWS is more accurate for 

7 tracers). 

— Comparing Strang R-V-V-R versus SWS splitting we see 11 tracers for which 

one of the schemes gave large errors (from which SWS is more accurate for 

8 tracers). 

We can state that for the most problematic stiff species the SWS splitting per- 
forms remarkably well. For three tracers, OH, HO 2 and NO 3 , the SWS splitting 
gave much better results than any of the Strang splittings. 
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Fig. 1. Solutions of ROS3-AMF+, Strang V-R-R-V, Strang R-V-V-R and SWS split- 
ting for trace gas isoprene on layer 1. 



In the experiments made with Strang R-V-V-R we found two cases where 
the results were unacceptable: for N 2 O 5 and NO 3 , where the correct trend of 
the concentration changes was not reflected: there was no sign of the high peaks 
shown by the reference solution. Meanwhile, the SWS splitting was able to de- 
scribe these peaks, see Figure 2. We can conclude that SWS splitting is not only 
generally better than Strang R-V-V-R, but, being free from some big errors pro- 
duced by that method, is also more reliable. This feature should be appreciated 
all the more because, as we already mentioned, in many cases it is not possible 
to decide, which Strang method would give better results. 

Returning to the question of a proper ordering of the sub-operators in the 
Strang splitting, we note that in our case the choice proposed in [10], namely 
V-R-R-V, was better than the other one, advocated in [6]. 

4 Parallelization of the SWS Splitting 

If several processors are used, the SWS splitting can be advantageous also from 
the viewpoint of the CPU time. All the methods considered in this paper can 
be parallelized across the space, using domain decomposition. However, since 
processes V-R and R-V can be computed independently, the SWS scheme has 
also a so-called parallelism across the scheme, which, in combination with the 
parallelism across the domain, leads to an attractive parallel algorithm. This 
across-the-scheme parallelization has a scalability factor two, i.e., 

( 10 ) 

Tsws{2p) 

where Tsws(p) denotes the CPU time for the SWS splitting parallelized across 
the space on p processors, and Tsws(2p) is the CPU time for SWS splitting 
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N03 layer=1 




Fig. 2. Solutions of SWS splitting and Strang R-V-V-R for trace gas NO 3 on layer 1. 



parallelized across the space and across the method, on 2p processors. The across- 
the-space parallelization for both Strang splitting and SWS splitting can be 
characterized by the speedup function 



S{B,p) 



P 

Bp+1- B' 



( 11 ) 



where B € (0,1) is the non-parallelizable fraction of the work in the algorithm. 
The parallel part requires (1 — B)T{l)/p time. Thus, 



Tstr(2p) 



7str(l) , Tsws(l) 

S{B,2py “ Tsws(iP) 



S{B,p), 



( 12 ) 



where Tgtr(2p) is the CPU time of the Strang splitting on 2p processors. By use 
of (10), 



Tsws(2p) 



rsws(l) 

2S{B,p)- 



(13) 



We know that if the traditional Strang splitting is used (namely, one step in the 



middle in (8)), then 


Tsws(l) 4 

Tstr(l) “ 3’ 


(14) 


It is easy to see that 


^sws(2p) < Tstr(2p) 


(15) 


whenever 


4 ^ 2S{B,p) 1- B + 2Bp 

3 S{B, 2p)~ 1-B + Bp' 


(16) 
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where the right-hand side increases monotonically to 2 as p 
quently, if 



P > Pcrit 



l-B' 

2B 



+ Ij 



-hcxD. Conse- 



(17) 



then for 2p processors SWS splitting is more efficient than Strang splitting. In 
Figure 3 we plot the predicted CPU times for Strang and SWS splitting versus 
number of processors for the case B = 0.15. We see that SWS splitting is faster 
than Strang already on 6 processors. 

Note that ROS3-AMF-I- scheme has the same parallelism as Strang splitting. 




Fig. 3. Comparison of CPU times for SWS splitting and Strang splitting as a function 
of the number of processors if Tstr(l) = 1- 



5 Conclusions 

We compared the solutions of ROS3-AMF-I-, SWS splitting and Strang V-R-R- 
V and R-V-V-R splittings in a one-column transport model with stiff vertical 
mixing and chemistry. Our main conclusions are as follows. 

— All the methods (which are equal in computational costs) give good results 
with relative errors mostly below 10%. 

— ROS3-AMF-I- gives the best results. Strang V-R-R-V splitting performs gen- 
erally better than SWS splitting, while Strang R-V-V-R splitting is least 
accurate with unacceptably big errors for two tracers. 

— SWS splitting gives acceptable solutions for all species. Also, for most of the 
problematic stiff species it performs better than any of the Strang splittings. 
Therefore, since it is generally not known which Strang splitting should be 
used, the SWS splitting can be a fairly reliable alternative. 
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— As opposed to Strang splitting, SWS splitting can be parallelized on the 
operator level, which, in combination with across-the-space parallelism, leads 
to an attractive parallel algorithm. 
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Abstract. In the 18-19 September exceptional weather conditions 
brought to the Scandinavia very hot air and long-range transported par- 
ticles. In the paper a dust episode detected in Southern Finland, Estonia 
and Sweden is discussed. Short and long-range transport of dust to the 
target areas are simulated with a grid model Hilatar. It was found, that 
the dust was originating partly from the Estonian and Russian oil-shale 
using power plants, however, there was partly from the desert sand lifted 
up by wind in Kazachstan. Frequency of dust transport is discussed. 

1 Introduction 

Climate variation and human activities have lead to increased erosion. Drought 
anddesertification threaten the livelihood of over 1 billion people in more than 
110 countries around the world [1]. The deserts are expanding all over the world, 
especially in Asia and Africa. For instance, the Aral Sea in central Asia, Kazach- 
stan and Usbekistan, has shrunk to less than half its size in just 30 years, be- 
coming salty and leaving in its wake 3.6 million hectares of polluted soil that can 
be swept up by fierce storms. The humid areas in the river delta have dried out 
by 85%. The amount of bird species decreased from 173 to 38 and from 24 fish 
species remain only 4 (Worldwatch Institute Report 1996). In China, the desert, 
expanding at an annual rate of 2 to 3 km, has encroached to within 70 kilometres 
of Beijing. By the late 1990s, China’s deserts were expanding by 2,460 sq km 
per year - 2,000 sq km in Inner Mongolia, where the Gobi Desert lies. 

In addition to local consequences to environment, food production, social 
live infrastructure and people health, long range transport of dust threatens 
ecosystems and humans far away from the deserts. Saharan dust clouds can 
travel thousands of kilometres over Atlantic. Dust clouds can fertilise the waters 
with iron, increasing blooms of toxic algae [2]. Dust particles act as condensation 
nuclei in rain clouds, decreasing average droplet size, and thus choking rain 
clouds. When rain zones are shifted away from dust emission areas, drought and 
erosion can accelerate [3] . 

2 Nordic Dust Episode, September 17—24, 2001 

Dust episodes have gained increased attention in Northern Europe due to their 
health effects. In Finland, the increased concentrations have been thought to be 
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caused by local sources, and especially in spring, by re-emitted dust from sand 
spread on the streets to avoid slipperiness of the frozen ground. Dust episodes 
have, however, detected also in summer. During September 17-24, 2001, high 
concentrations of PMio (with simultaneous maximum concentrations of 190 
m~^ for PMio and over 40 gg m~^ for PM 2 . 5 ) were observed at more than 10 
monitoring stations along the Estonian and Finnish coasts of the Gulf of Finland 
and in the central part of Sweden. The episode lasted one week. 

The meteorological situation at this time was exceptional: a rather strong 
Siberian high remained in almost a same position for more than one week. Along 
its southern flank exceptionally warm air was transported from the semiarid cen- 
tral Asian areas north of the Aral Sea and the Caspian Sea towards Fennoscandia. 
North of the Black Sea this flow met another warm stream originating from the 
Mediterranean and North African areas. The warm air mass was exceptionally 
deep: in Finland the temperature T(z) at the FMI Jokioinen sounding station 
was around 9° C warmer at altitudes of 0.75-1.4 km than the 1961-1980 Septem- 
ber average of [4]. The atmosphere over the Gulf of Finland and in night-time 
also over its surroundings, was strongly stable stratified. 



3 Local-Scale Simulations 



The origin of the elevated dust was estimated with the dynamical grid model 
Hilatar [5,6], with a 0.2-0. 4°, around 22-44 km grid distance, by backward and 
forwards simulations [7]. In backward simulations the model concentrations were 
fitted to the measured ones at the stations, and simulations were performed by 
picking meteorological input fields in reverse time order. The results show, that 
the detected concentrations could have been transported from the Estonian or 
Slantzy oil-shale burning power plants with surface winds at western stations in 
Finland, while the eastern concentrations (Kotka) could have been transported 
to the northern coast by elevated winds over the stable Gulf, being then mixed 
down by the coastal turbulence. The forward simulations confirmed this origin for 
the dust. The simulated concentrations rise both on the northern and southern 
coasts of the Gulf on the afternoon of September 18 while the flow was channelled 
over the Gulf of Finland. The high-level Estonian plumes met the flow originating 
from St. Petersburg over the Finnish coastline, while the low-level emissions also 
remained also above Estonia. 

The elemental composition of individual particles was studied with a scan- 
ning electron microscope (ZEISS DSM 962) coupled with an EDX (LINK ISIS 
with ZAF-4 measurement program). Additional chemical composition of sam- 
ples collected with a virtual impactor were analyzed by IG and IGP/MS [7]. 
The experimentalists concluded, that although there was very small particles 
on the surface of the spherical combustion particles, the shape and chemical 
composition is so close to the previous samples collected near the Estonian oil- 
shale emission sources, that the origin of the episode is very close to the Finnish 
borders. 
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However, there remained a question of missing sulphur: the oil-shale burning 
power plants produce significant amounts of SO2 and S04-particles, but those 
concentrations were not as high. Because also the measured dust concentrations 
were higher than those simulated, other dust source areas along the backward 
trajectory from the measurement places could have been contributed to the 
episode. The relative humidity was high so that fine matter, while transported 
over the Estonia, could have been condensed on the surface of the Estonian 
particles. For this reason, the simulation area of the dust model was extended 
to cover the area of the European Hilatar. Simulations with various artificial 
sources, put on it’s eastern border, confirmed, together with analysis of the 
meteorological maps [8], that the fine dust collected detected on the particle 
surfaces was most probably coming from the Ryn Peski desert in Kazakhstan. 
In that area the annual precipitation amounts to 100-200 mm, summers are 
hot and evaporation strong [9]. During the year 2001, the precipitation in July- 
September was 25-50 % from the average 1961-1990 reference period value, in 
August 2001 below 25 % [10]. The area had been suffering from drought. 16^^ 
of September was also rather windy. 

The suspected Ryn Peski desert is located outside the simulation area of the 
European Hilatar, thus no real simulation results could be presented in [7]. Be- 
cause the warm air front consisted also from Sahara air, circulated northwards 
over Turkey, it was necessary to extend the model to cover the whole HIRLAM 
forecast area east- and southwards, add to the dust model a dust emission mod- 
ule, and to estimate, if it is possible to describe such a large scale phenomena 
with a 3D grid model. 

4 Dust Emission from Deserts 

The total flux of uplifted dust (Fa) whose radius is in the range of 0.1- 30 um 
was estimated in [11, 12] to be Fa = 5.210“^"^^*^ if u* > u*t. Fa = 0 else, where 
u*t = threshold friction velocity, u*t, below which there is no flux. u*t, = 60 
cm s“^ in Sahara, and 50 / 45 / 35 cm s“^ in Gobi, Sand and Loess deserts in 
China. Further developed form was applied in [13]: Fa = AT ■ P/ g ■ u* —u*^) , 

where P is air density, g acceleration due to gravity, and K a surface-dependent 
semi-empirical coefficient with unit m“^. The dust emission assumes to occur 
in the desert and/or barren area with no vegetation. A vegetation cover reduces 
the dust emission flux. 

5 Simulations with the Extended Model 

Because HIRLAM surface classification does not contain desets, the first emission 
formula was selected and the desert area was approximated from old maps to 
cover around 24 HIRLAM grids in Kazachstan. Over this area, the critical limit 
for the u* exceeded 35/45/55 cm/s over 38.9/19.4/8.5 % of time in the year 
2001. If we take the lowest u* threshold limit, the average emission intensity of 
24 grid covering the Ryn Peski desert (in g m“^ s“^) in September comes to 
that presented in Fig. 1. 
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Fig. 1. Dust emission intensity, in g m ^ s ^ in September 2001 



In Fig. 4 the surface pressure, flow direction and friction velocity are pre- 
sented over the new model area, on the 17^^ of September, 2001, when the 
friction velocity over the desert area exceeded 50 cm s“^. The forecasted surface 
wind velocity on the 16^^ of September exceeded only 11 m s“^, which anyway 
is high enough to lift dust from the ground. The third model layer dust con- 
centrations in air (on around 335 m height) during the episode simulated with 
the European Hilatar model with extended coverage (Fig. 2) show, that the local 
Estonian dust sources just partly caused the episode. There was evidently a long- 
range transport episode from more distant areas. Most of the fine particles were 
transported on the model layers 3 and 4, and mixed to the Estonian dust and 
afterwards to the surface over convective areas. The pressure maps support the 
simulation results. They show clearly that fine dust came only from Kazachstan 
Run Peski, the deserts eastwards of it could no affect the long-range transport 
(LRT) event in Scandinavia. 

The annual dust deposition in 2001 from this source area is presented in 
Fig. 3. According to it, the annual dust deposition in Scandinavia from such 
events is insignificant. The reason for it is explained by the correlation of rain 
and meteorological situations favourable for dust LRT, discussed in the next 
chapter. 

6 Frequency of Similar Events 

The frequency of such LRT events was estimated from simultaneous occurrence 
of the prevalent pressure distribution and high wind speed (friction velocity) 
over the desert. From Fig. 4 we can see that for such a situation there must 
prevail a high pressure area over the north-eastern model region (A) and low 
pressure areas over northern Atlantic near the Norwegian coast (B) and over 
Germany (C). When pressure gradients between those areas were picked from 
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Fig. 3. Annual dust deposition, emissions from Ryn Peski 



the meteorological data base, p(A-C) and p(A-D) were both > 5 mb positive 
23% of the time of the year. The average friction velocity over the Ryn Peski 
desert exceeded simultaneously 35/45/55 cm/s 9.3/5. 1/2.4 % of the annual time. 
This means, that the correlation of the high wind velocity with the south-eastern 
flow direction toward the Baltic Sea area must be rather high; in 40% of those 
events, we might have Kazachstan dust in the air. 

The correlation of a positive pressure gradient (A-C) with precipitation in 
Ryn Peski in 2001 was —0.05. Thus, in weather types favourable of dust LRT 
transport westwards from Kazachstan it is not raining in the source area (any- 
way, we deal with desert). The correlations of high pressure gradient between 
A-C and A-D with rain over Southern Finland were 0.11 and 0.12. 
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Fig. 4. Friction velocity and surface pressure on the 17^^ of September 2001 at 12 
UTC, characteristic conditions for the episode 




Fig. 5. Mean dust column burden, 1967-1988 [mg m ^]. All sand dune areas in the 
domain are included into emission calculations 



7 Frequency of the Pollution Events in Scandinavia 
Caused by Caspian and Saharan Dust Storms 

Apart from analysis of the episodes, the current study targeted a more general 
question - how often and how severe the dust pollution events in Scandinavia 
are from climatological point of view? 

To approach this task, a dispersion model DMAT [14, 15] was run through the 
period of 22 years from 1967 till 1988 for the Northern Hemisphere, with ~ 150 
km horizontal resolution in northern polar stereographic projection (Fig. 5). Grid 
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Fig. 6. Daily mean of the dust column burden over Scandinavia. Average over enlarged 
area shown in Fig. 5 




Fig. 7. Histogram of the daily dust burden over Scandinavia. Average over enlarged 
area shown in Fig. 5 



size was 99 x 99 cells. The dust elevation by strong winds was computed following 
the semi-empirical algorithm of [16] as described in [14]. 

Two sets of runs have been made for two different emission sources. In the 
first setup, the model considered only sand in Caspian region as a potential 
source. For the second run, all other sand dune areas, but the Caspian region, 
were included. Due to linearity of the problem, the total dust pollution in the 
domain is a sum of the obtained patterns (shown in Fig. 5). 

As seen from the Fig. 5, the mean dust burden in Scandinavia is negligible 
compare to southern part of Europe. 

However, Fig. 6 and Fig. 7 suggest that the number of days with recognizable 
amount of Saharan or Caspian dust in air over Scandinavia can be significant. 
Assuming the length of typical dusty episode to be three days, one can get 
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that there is more than one case per month when vertically integrated burden 
of Caspian dust is over 1 mg m“^. This amount is usually spread over several 
kilometers of altitude, so that the typical dust concentration during such episodes 
is a fraction of m“^. For Saharan dust, the frequency of the pollution events 
is about one order of magnitude smaller. 

Strong dust pollution episodes (dust column burden > 1000mgm~^) hap- 
pened in Scandinavia just a few times during 22-years (Fig. 6). It is also seen 
that there is a strong seasonality of the events, which corresponds with the sea- 
sonality of the dust storms concentrated in the cold part of year and weak in 
summer months [14]. 

8 Conclusions 

In September 2001, when high dust concentrations were detected at several moni- 
toring stations in Scandinavia, there was evidently a long-range transport episode 
originated from Kazachstan. Most of the fine particles were transported on the 
model layers 3 and 4, and mixed to the plumes of local dust sources around 
the Gulf of Finland, and afterwards to the surface over convective areas. The 
frequency of meteorological situations, when LRT events from eastern deserts 
to Scandinavia might occur, is rather high, and the concentrations might have 
an effect to the surface air quality. During those situations there is seldom rain, 
and the deposition levels will be low. However, this conclusion is based on one 
year analysis of meteorological situations. 

Long-term analysis of the frequency of the dust episodes in Scandinavia sug- 
gests that moderate events appear quite frequently, primarily originating from 
the Caspian sand deserts. However, probability of really severe episodes with de- 
graded visibility and other features of heavy dust pollution are very rare - a few 
cases during the considered 22-years period. Distribution of the mean load, as it 
can be expected, considerably non-homogenous. Southern regions of Scandinavia 
get about 10 times the load of the northern ones. 
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1 Introduction 



Simulations of atmospheric pollution and air quality requires skills in various 
domains such as meteorology, chemistry, fluid mechanic and turbulence. The 
numerical simulation of all of these takes long CPU time and needs big com- 
putational resources on current machines. For a lot of cases, it is a common 
feature of Atmospheric Quality Models (AQM) that the chemistry part requires 
the highest computational demand relative to other portions, due to a large 
number of species and reaction rate which have to be taken into account. For 
the photochemistry and the ozone production, reduced chemical mechanisms 
use about several ten of species and reactions, as for example the Carbon bond 
IV (33 species and 81 reactions) [6], However, as it is noted by Elbern [4]: The 
chemistry component of the simulation, which is the most consumptive part 
of computational resources, must be calculated on each grid point, but does 
not requires any intermediate communication. This suggests the use of parallel 
computing since the arithmetic to communication ratio is usually high for such 
applications. Studies have been already performed to corroborate the suitability 
of massively parallel processing [2], but very often with specific machines devoted 
to parallel computations. 

In this work, our aim is to quantify PC cluster performances for these applica- 
tions, specifically in our case for urban air quality assessment. In all laboratories 
and offices, it is easy to have PC clusters connected on ethernet networks. Nowa- 
days. in wide institutions, these platforms may exceed several ten of nodes but 
this is theoretical and generally less nodes are available, due to traffic load in 
networks. To emphasize the promise of such local platforms, we have carried out 
simulations of different atmospheric pollution scenarios on a small PC cluster 
with a maximum of 9 nodes (network: Ethernet; processor: INTEL3, SOOMhz, 
512Mo). In a first part, we give a short description of our air quality model, then 
we present the choice of parallelism technique we used, and finally efficiency 
results for inert and chemical cases are showed before to conclude. 
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2 The Eulerian Air Quality Model 

The Eulerian model is represented by the following system of partial differential 
equations. 



+ div{V{x,t)Ci) = div{D(x,t)\7Ci) + Ri + Si 
at 

where x represent the co-ordinate system in the 3D space, Ci is the concentration 
of the pollutant i, V{x,t) is the wind velocity in horizontal direction. D(x,t) is 
the turbulent diffusive coefficient, it has been kept constant or calculated using 
the Louis ID parameterisation [7]. The Ri term is the chemical reaction rate of 
species i, 

R, = R-L,-C 

with Pi is the vector of production term and LiC is the loss term (L^ is a diagonal 
matrix). Si can be write for a specie i, under the form: 

Si — Pi{x j t) 'i^dep,iCi 

and Ei represent the emission source and Vdep,i ~ the dry deposition velocity. 

Appropriate discretization procedures (finite differences techniques in For- 
tran90) have been applied to the system of partial differential equations. One 
particularity of our model is to be able to refine the meshes in some part of the 
computational domain [9] , however this was not use in the present study. 

The time-split operator is applied as it is a methodology generally used in 
AQMs [3, 14]. The integrations are performed sequentially using a transport time 
step of about several ten second. The chemical term is treated numerically by 
the QSSA [13] or two-step algorithm [10, 12], which are suitable for stiff systems 
of ordinary differential equations. The chemistry time step is a multiple of the 
transport time step as shown by the schematic view of the code structure in 
Fig. 1. 



3 Model Parallelisation 

Two methodologies have been proposed addressing the question of an efficiency 
parallelisation: each species can be entrust to one processor or, in the other side, 
grid partitioning (domain decomposition) can be used. The first approach may 
bring different computational load of the nodes, some species varying faster than 
other ones, which distorts the parallel efficiency. Therefore, we choose domain de- 
composition which is based on partitioning the domain into several sub-domains. 

With this methodology, one can use partial parallelisation schemes, for exam- 
ple only chemistry [5]. In a first approach, this strategy seems interesting since 
the chemical part needs no communication. With such approach, the speed-up 
may be low as it is explained below. In the present work, we prefer parallelisation 
of the entire coupled model (transport and chemistry), and since we used the 
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Initial stage : Input data 
calculation which do not depend on time 



Time loop : 

Advection + diffusion time step 
with communication needed to update 
boimdaries 

T ime cycle on chemistiy time step 

i " 

Final stage : output data 



Fig. 1. Schematic view of the code structure and algorithm for one processor. The 
big arrow on the right side indicates communications with neighbour nodes during the 
time loop, the single line arrows is for communications between all nodes before the 
time loop. 



Message Passing Interface (MPI) library, it needs minimal modification of the 
original serial code. 

Advection and diffusion time step use second order scheme in space, so some 
communications between neighbour processor are necessary after the chemistry 
part of the code as seen in Fig. 1. If only chemistry part of the code is parallelised, 
the algorithm would be different from the one reported on Fig. 1. During the 
time loop, only one processor is entrusted with the transport calculation leading 
to massive communications with all the processors at the begin and at the end 
of chemistry calculations. So this step is like a bottleneck’ distorting the parallel 
efficiency. 

In all the numerical experiments reported here, we have attributed a domain 
of equal size to each processors. 

The parallel performance is evaluated in term of the efficiency E{p) = S{p) /p, 



where p is the number of nodes engaged, S{p) the speed up, S{p) = 



T(i) 1 



T(l) being the time for a sequential execution and [T(p)™“] the longer CPU 
time among all the processors. However, the restitution time, define as 
which is a more user oriented’ parameter compared to the efficiency, is also 
important issue which has to be addressed. 



4 Results 

To test the performance of parallel computation, we have carried out simulations 
of different air pollution scenarios with one to 8 or 9 nodes (network: Ethernet; 
processor: INTEL3, SOOMhz, 512MB). 
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4.1 Case with No Chemical Transformations 

The non reactive specie (NOx for example) is emitted from the ground by an 
urban area, within a domain represented by 40*38*20= 30400 grid points. Hor- 
izontal and vertical meshes are equal to 1km and 50m respectively. The wind 
field is provided by ALADIN data from Meteo-France, it is not uniform over the 
domain. The diffusion coefficient is constant and equal to 10 m.s“^. 



Case with no chemical transformations 




number of processors 



Fig. 2. Efficiency of the parallel computations as a function of the number of nodes for 
the case of non reactive specie. 



Figure 2 shows the efficiency decreases quickly, under 50%, for node number 
greater than 3. This results has been expected owing to the communication over- 
head. Without chemistry, the pollutant transport calculations are not important 
and arithmetic to communication ratio is low. However, we have to remember 
that the restitution time is reduced, but not so much (a factor of 2.4). If the 
number of grid points increases, using bigger domain or mesh refinement, the 
efficiency should be better, and in the other side, the increase of the number of 
nodes will not improve the efficiency. Thus we think the present case, with several 
ten thousand points, is the lower limit below which the transport parallelisation 
is not efficient and is not useful. Figure 2 show the efficiency is about 80% for a 
few nodes (less than 3). Neglecting the overhead due to initial stage, this indi- 
cates a minimum loss of efficiency of about 20% due to transport time step. With 
chemistry, this ‘incompressible’ loss will not be avoided and we will expect an 
efficiency improvement for the node number greater than 3 but efficiency values 
will not be greater than 80%. 



4.2 Cases with Chemical Transformations 

Two cases of photochemical episode over urban areas have been studied, one 
with a simple chemical scheme (9 reactions and 9 species) [8], and one with a 
more realistic mechanism (121 reactions and 63 species) [1]. 
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Simple Chemistry. This first case use a domain of 50*55*20 nodes, and the 
emissions are confined into a linear source as shown in Fig. 3. The wind speed is 
uniform and kept constant at 2m.s“^, diffusion coefficient and temperature vary 
with height and there have been determined with a ID model of the atmospheric 
boundary layer [11]. The actinic flux also varies during day which allows to take 
account, in this numerical experiment, of the variation of the calculation burden 
at the sunset and onset. 
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Fig. 3. Description of the scenario using a simple chemical scheme. 



With this simple chemical mechanism, we can see the additional arithmetic cal- 
culations provide an increase of the efficiency for 6 nodes. This is an improvement 
compared to the previous inert case. Above seven nodes, too much communica- 
tions are made in the transport step in the code, decreasing the efficiency below 
50%, and a grid partitioning with a greater number of block is not recommended. 

Detailed Chemistry. In the second case, we use a grid with similar size, 
40*48*20 nodes. However the chemistry is more detailed (121 reactions and 63 
species). This numerical experiment corresponds to the case of an urban plume 
idealized by three nesting emission zones. The urban area is defined by a square 
surrounded by a regional zone, which is include in a rural zone (Fig. 5). All 
the emissions fit a daily evolution, which average values are given in table 1. 
The wind speed is uniform and kept constant at 2m. s“^. As for the previous 
case, diffusion coefficient, temperature and humidity are also function of the 
time. This kind of simulation is representative of a real case of photochemical 
pollution episode occurring over Paris in June 1995. 

Figure 6 shows the efficiency remains greater than 50% with height processors. 
So, this numerical experiment, which is representative of simulations carried out 
for ozone impact assessment over large urban areas, shows that a small cluster 
(ten processors), like the one placed at our disposal, is sufficient to speed up the 
calculations and to reduce the computing time suitably. For example with height 
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Table 1. Emissions {molec.cm ^ .s used for the urban scenario with detailed chem- 

istry. 



Species 


Urban Emissions 


Regional Emissions 


Rural Emissions 


NOx anthropiques 


2,52 . 10"’" 


rITTIcP 


0,008 . 10"^ 


NMHC’s anthropiques 


12,3 . 10^^ 


0,69 . 10^2 


0,41 . 10^2 


NOx biogeniques 




1,2 . 10^° 


4 . 10® 


NMHC’s biogeniques 

(isoprene) 


1,5 . 10^® 


5,13 . 10^° 


14 . 10^° 


CO 


8,35 . 10 “ 


5 . 10 “ 


SO 2 


2,1 . 10 “ 


2,1 . 10 “ 


3,3 10^° 


CH 4 


6 . 10 “ 


6 . 10 “ 


6 . 10 “ 



case with a simple chemical mechanism 




Fig. 4. Efficiency of the parallel computations as a function of the number of nodes for 
the case of a simple chemical mechanism. 



zone reginale ou periurbaine 



zone urbaine 



zone rurale 



200 km 



1 

300 km 

Fig. 5. Layout of the urban scenario with detailed chemistry. 



nodes, an efficiency of about 50% implies a restitution time which is reduced 4 
times compared to a sequential run. 

On Fig. 6, the different results for a particular node number correspond to 
simulations obtained with different domain decompositions. As indicated previ- 
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urban scenario with a detaiied chemistry 




Fig. 6. Efficiency of the parallel computations as a function of the number of nodes for 
the urban scenario with a detailed chemistry. 



ously, we have attributed a domain of equal size to each processor. The different 
results in Fig. 6 have been obtained by adding or subtracting only one vertical 
plane of meshes to the processors which have the biggest calculation load. These 
one are located downwind, including a part of the urban plume (or the totality 
depending of the number of node). From Fig. 6, it is obvious the load balancing 
is important for these particular nodes and this needs further works. 

5 Conclusion 

We perform several numerical experiments on a small PC cluster connected with 
an ‘ethernet’ network. These simulations are representative of urban pollution 
impact calculations, it means the grid size is about several ten thousand points. 
For such cases, the efficiency of parallel computing is low with out chemistry 
and it is great if chemistry has to be taken into account. 

However, further works must be done on the load balancing, which has not 
been studied extensively in the present work. But we noted that the grid par- 
titioning has a great influence, mainly for processors dealing with the urban 
plume. 
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Abstract. The scientihc interest on improving the air quality tools 
which are able to provide a detailed description - source oriented - of the 
atmospheric process and the impact of specific large industrial emission 
points is growing. This contribution shows the results of using a state- 
of-the-art 3rd generation air quality modelling system (MM5-CMAQ) 
over a pilot test industrial site in Madrid mesoscale domain. The MM5 
model is a well known non-hydrostatic mesoscale meteorological model 
developed by PSU/NCAR and the CMAQ (Community Multiscale Air 
Quality Modelling System) is a 3rd generation air quality dispersion 
model developed by EPA (U.S.). Both are modular codes and include 
an interface called MCIP which azures full consistency on the numeri- 
cal methods in the meteorological advection and difusion schemes in the 
meteorological module and the chemical and dispersion module. The ap- 
plication has been done with CBM-IV chemical scheme (simple version). 
The results of running the so-called ON /OFF approach - in other words 
the differences between the simulation with full industrial emissions and 
no industrial emissions - show that the MM5-CMAQ modelling system 
can be used as an excellent tool for evaluating the impact of different 
emission sources in an operational mode. Results also show that increases 
up to 50 % on average and larger values for hourly impacts for ozone con- 
centrartions are found in the surrounding areas of the industrial plant. 
In addition, we see that the distance where the industrial emissions are 
impacting on the air quality can be larger than 40 km. 



1 Introduction 

The air quality impact of industrial plants has been a key issue on air quality 
assessment and modelling since the 70’s. Nowadays, the increased capacity on 
computer power and progress on air pollution science provide a powerful and 
reliable tool to measure the air quality impact on industrial emissions. In the 
last decade a considerable effort to incorporate the industrial production pro- 
cesses in an integrated environmental evaluation in on-line mode has been done. 
In a parallel way, a considerable increase of citizen concern has been detected 
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particularly in those areas where important industrial point sources are present 
together with highly populated areas (urban areas). This case is particularly 
sensitive for refineries, waste city incinerators, etc. 

In this contribution we will show the results of a preliminary modelling ex- 
periment to build a so-called TEAP tool (A tool to evaluate the air quality 
impact of industrial emissions) . This tool is designed to be used by the environ- 
mental impact department at the industrial site. The tool provides a response 
to air quality impact to industrial emissions in the form of surface patterns and 
lineal time series for specific geographical locations into the model domain. The 
model domain is designed in a way that the industrial source point is located 
approximately in the center of the model domain. The model domain can be as 
extense as wished but a specific nesting architecture should be designed for each 
case together with a balanced computer architecture. 

The TEAP tool (an EUREKA-EU project) has the capability to incorporate 
different modelling systems. In a preliminary stage we have tested the system 
with the so-called OPANA model. OPANA model stands for Operational At- 
mospheric Numerical pollution model for urban and regional areas and was de- 
veloped at the middle of the 90 ’s by the Environmental Software and modelling 
Group at the Computer Science School of the Technical University of Madrid 
(UPM) based on the MEMO model developed in the University of Karlsruhe 
(Germany) in 1989 and updated on 1995, for non-hydrostatic three dimensional 
mesoscale meteorological modelling and SMVGEAR model for chemistry trans- 
formations based on the CBM-IV mechanism and the GEAR implicit numerical 
technique developed at University of Los Angeles (USA) in 1994. The OPANA 
model has been used (different versions) for simulating the atmospheric flow - 
and the pollutant concentrations - over cities and regions in different EU funded 
projects such as EMMA (1996-1998), EQUAL (1998-2001), APNEE (2000-2001). 
In these cases and others the model has become an operational tool for several 
cities such as Leicester (United Kingdom), Bilbao (Spain), Madrid (Spain), As- 
turias region (North of Spain) and Quito (Ecuador, BID, 2000). In all these cases 
the model continue to operate under daily basis and simulates the atmospheric 
flow in a three dimensional framework. The OPANA model, however, is a limited 
area model - which means that the model domain is limited by the earth curva- 
ture - and the cloud chemistry and particulate matter is not included (aerosol 
and aqueous chemistry). 

In this contribution we will use the MM5-GMAQ modelling system. MM5- 
GMAQ is a representative of the last generation of AQMS (third generation of 
Air Quality Modelling Systems) developed by EPA (USA) in 2000. The model 
uses a full modular structure with the last advances on computer programming 
(FORTRAN-95). In essence many of the features of MM5-GMAQ are similar to 
OPANA but the programming and modularity is more advanced. MM5-GMAQ 
is not a limited area model and it can run over large domains (even at global level 
although a GMAQ global version is not existing yet). The model domains are 
obviously closely related to model forecast horizon so that the nesting capability 
(in a similar way that it was done in OPANA) plays an essential role to have 
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reliable simulations over city and regional domains. Another representatives of 
third generation of AQMS are CAMx (Environ Co., USA) and EURAD (Euro- 
pean Ford Research Group and university of Cologne (Germany )). MM5 is the 
well known Non-hydrostatic Mesoscale Meteorological Model developed in the 
Pennsylvania State University and NCAR starting on 1983. The MM5 is today 
one of the most robust and reliable meteorological models. In both cases and in 
all Eulerian models, the input datasets are key elements to work down to have 
reliable and accurate simulations. These datasets are: DEM (Digital elevation 
model). Land use data (usually satellite data, AVHRR/NOAA, Landsat, Spot, 
etc.). Initial and boundary meteorological conditions, initial and boundary air 
concentration profiles and finally, emission data sets. The emission datasets are 
usually the bottle neck on this type of applications since the uncertainty in- 
volved is important. In spite of this limitation, the TEAP tool extract the most 
important benefits from the relative difference between a simulation with full 
emission inventory and a second simulation with an emission inventory without 
the industrial plant to be studied. 



2 Experiment 

We have implemented the MM5-CMAQ modelling system in a nesting architec- 
ture. The MM5 Mesoscale Meteorological Model (PSU/NCAR) and the Com- 
munity Multiscale Air Quality Model (CMAQ) [I] from EPA (USA) (third gen- 
eration of air quality modelling systems) are used as mainframe platform. The 
MM5 is built over a mother domain with 36 x 36 grid cells (81 km spatial res- 
olution) and 23 vertical levels. This makes a domain of 2916 x 2916 km. The 
nesting MM5 level 1 model domain is built over a 69 x 66 grid cells (27 km 
spatial resolution) and 23 vertical levels, which makes a model domain of 1863 s 
1782 km centered over the Iberian Peninsula. CMAQ model domains are 30 x 30 
grid cells for mother domain and 63 x 60 over the nesting level 1 model domain. 
CMAQ mother domain lower left corner is located at (-1215000 m, -1215000 
m) at the reference locations (-3.5W, 40N) and the first and second standard 
parallels (30N, 60N). The CMAQ nesting level 1 lower left corner is located at 
(-891000, -810000) with the same reference locations. The 9 km MM5 spatial res- 
olution model domain has 54 x 54 grid cells, the 3 km MM5 spatial resolution 
model domain has 33 x 39 grid cells and finally the 1 km MM5 spatial resolution 
model domain has 30 x 30 grid cells. The corresponding CMAQ model domains 
are: 48 x 48 km, reference (-216000, -216000) in Lambert Conformal projection 
with 9 km spatial resolution; 27 x 33 grid cells, reference (-54000, -9000) with 

3 km spatial resolution and finally, 24 x 24 grid cells, reference (-27000, 33000) 
with 1 km spatial resolution. In this contribution we will show results for the 3 
km spatial resolution or nesting level 3 only. Figures 1 and 2 show the different 
CMAQ domains - mother and nesting levels. 

The industrial plant is located at (.6727.0, 62909.65) in Lambert Conformal 
Coordinates. This industrial plant emits 340 Tn/year of S02, 155 Tn/year NOx 
and 2.9 Tn/year VOC’s. We have selected - as a preliminary test - the period 
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Fig. 1. Mother and nesting level 1 on this experiment for MM5-CMAQ modelling 
system. 



of February, 4-8, 2002. The emission database is obtained from the EMIMA 
model [2] and EMIMO model - a new emission model for large domains based 
on global emission ineventories such as GEIA, EMEP, EDGAR and the Digital 
Ghart of the World -. MM5-GMAQ is initiated by using global data sets from 
MRF (NOAA/NGEP, USA) and mother, nesting levels 1 and 2 provide the 
boundary conditions for running MM5-GMAQ for nesting level 3 over the Madrid 
Gommunity Area with 27 x 33 grid cells (3 km ) which makes 81 x 99 km. The 
OPANA model runs over a domain of 80 x 100 km with UTM coordinates and 
using only EMIMA data sets. EMIMA datasets are also used for MM5-GMAQ 
over nesting level 3. EMIMO data set is used for mother and nesting levels 1 
and 2 for MM5-GMAQ. We have performed two simulations (MM5-GMAQ), one 
simulation with the industrial plant emissions and the second one without the 
industrial plant emissions. 

3 Results 

The results show important differences in relative values for ozone concentrations 
over the model domain. The impact of industrial emissions on ozone concentra- 
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Nesting level 3 



MM5-CMAQ Process Analysis 



Fig. 2. Nesting level 3 for MM5-CMAQ modelling system. 



tions are distributed along the North-East wind direction on average for the 120 
hours simulations period. In the immediate surrounding areas of the industrial 
plant the impact of NOx and VOC’s emissions are capable to reduce up to 40-50 
% of ozone concentrations on “average” over the 120 hours. 

Figure 3 shows the differences between ozone concentrations with and with- 
out industrial plant over the 3 km grid cell where the industrial plant is located 
with MM5-CMAQ. Important differences are found. The industrial emissions 
decrease the ozone levels up to 40 - 50 % during the 3-4 day of simulation pe- 
riod. A depth analysis on the impact of industrial emissions by using process 
analysis techniques is shown in Figures 4. This plot illustrates the variations in 
process contributing during the simulation period. The contributions to ozone 
concentrations at industrial grid cell (16,23) over the simulation period (0- 120 
hours) for advection in E-W and N-S direction (XYADV), vertical advection 
(ZADV), mass adjustment for advection (ADJC), chemistry (CHEM) and fi- 
nally the simulated ozone concentrations (with industrial emissions) are clearly 
shown. Horizontal advection and chemistry seem to play an important role at 
the industrial grid cell in accordance with the simulated ozone concentrations 
during the days 3-4 of the 120 hour simulation period. 
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Fig. 3. MM5-CMAQ ozone concentrations at industrial plant cell with and without 
industrial emissions. 



The above time series and surface patterns show that TEAP can be a relevant 
software tool to help industrial plant managers to integrate friendly environmen- 
tal practices into the industrial production process and help to fulfill to the city 
and regional authorities with the environmental regulations particularly when 
pollution episodes are forecasted and the industrial plant plays an important 
role on those episodes. TEAP can identify in a clear and systematic way the im- 
pact of the industrial plant emissions (the industrial plant implemented in each 
specific case) over the spatial domain and under temporal basis. The spatial and 
temporal location is immediately identified by using the software expert system 
designed in TEAP just after ending both simulations (ON and OFF). 

The results of this modelling experiment show that TEAP can constitute an 
important tool for industrial managers in order to fulfill with the environmental 
regulations present at different countries. The tool has been implemented over 
three PC-LINUX-Red Hat machines with 1 GB RAM memory and 120 Gb hard 
disk each. The three PG’s have been connected in an Internet network to run 
in operational mode to produce real-time industrial air quality impacts under 
daily basis. The software is still under further developments and adaptation 
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MM5-CMAQ 3KM RESOLUTION (27X33) 120 HOURS 04-08, FEBRUARY.2002 OELL(16,23) 




Fig. 4. Process analysis (I) at industrial grid cell. 



since several additional features will be implemented focusing on the features 
required to avoid pollution episodes being partially due to industrial emissions 
particularly on those areas in the near by of industrial plants. The second phase 
is planned to integrate the industrial production process under an optimized 
cost-benefit analysis into the environmental conscious industrial management. 

The surface patterns illustrated here show that differences between 10 to 50 
% - depending of the air pollution model - can be found in the surrounding area 
of the industrial plant. These differences are important since they generate in- 
creases in ozone concentrations up to 40 % in the surroundings of the industrial 
plant in MM5-CMAQ model. Further analysis and more simulation periods are 
required to establish more solid consequences than those obtained in this prelim- 
inary analysis. The TEAP-EUREKA project is intended to develop an expert 
system tool to analyze a large amount of simulation periods to establish the cal- 
ibration between monitoring data and simulated data. The results will be used 
to interpret the relative differences between ON and OFF scenarios with the 
corresponding error margins. The results in real-time will be used by the indus- 
trial plant managers to optimize the cost-performance-pollution impact relation 
in order to quantify the impact on the area of the different emissions. 
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Abstract. The present work is an overall report on an extensive joint 
study, aimed at a detailed study and explanation of the pollution trans- 
port in the air basin over South-western Bulgaria and Northern Greece 
and assessment of the air pollution exchange between Bulgaria and 
Greece. Some well known specific climatic air pollution effects were stud- 
ied and explained. Calculations were made of the S02 pollution of the 
Balkan peninsula from both Greek and Bulgarian sources for 1995 and 
the country to country pollution budget diagrams were build. Days with 
extreme mean concentration for Bulgaria and Northern Greece were 
picked out and some further specification of the contribution of the dif- 
ferent sources in both the countries to these cases of extreme pollution 
was made. Some preliminary studies of possible mesoscale effects on the 
pollution exchange between Bulgaria and northern Greece were carried 
out. A three-layer pollution transport model with more complex chem- 
istry block was introduced and some preliminary simulations of Sulfur 
and Nitrogen compounds transport were performed. . . . 



1 Introduction 

The present work is a compilation of already published papers. The reason for 
presenting it is giving a general view and providing an overall report on an 
extensive joint study, aimed at a detailed study and explanation of the pollution 
transport in the air basin over South-western Bulgaria and Northern Greece and 
assessment of the air pollution exchange between Bulgaria and Greece. 

2 Research Methods and Techniques 

The air pollution studies and the following assessments were based on reliable 
and representative data: BREWER and DOAS air pollution data from Thessa- 
loniki since 1982; detailed synoptic information for the Balkan peninsula since 
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1994; detailed emission inventory, based on the CORINAIR methodology since 
1990; information from ground-based air pollution measuring stations 

The IMSM (Integrated Multi-Scale Model) [2, 7] and EMAP [6] are the main 
modeling tools used for pollution transport simulations. A simplified 3-layer 
model [5] with a more complex chemistry block and a quasi-stationary model of 
the mesoscale dynamics [1] are also applied in some of the studies. 

Extensive computer simulations were carried out and the conclusions were 
made on the basis of joint analysis of data and simulation results. 

3 Main Results 

3.1 Summer Sulfur Pollution in Thessaloniki 

A specific climatic effect was registered in the city of Thessaloniki by analysis of 
the time series obtained by the 15 year Brewer spectrophotometer measurements 
of the columnar SO 2 - although the maximum concentrations of columnar S02 
originate form local sources in the winter, significant columnar SO 2 amounts are 
also seen in the summer, when there are no local sulfur sources. On a few occa- 
sions however, as high as 5-7 m-atm-cm of columnar SO 2 have been measured 
both in winter and summer period. The nature of this phenomena was studied 
for selected episodes [9] (see Fig.l.) and for July, August 1994 [10]. 
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Fig. 1. Evolution of the measured and simulated columnar SO 2 [m — atm — cm] over 
the city of Thessaloniki for two episodes. The error bars show the standard deviation 
of the measured columnar SO 2 . 



It was proved, that the cases with high S02 quantities are most often con- 
nected with N-NE flows over the city (Table 1). 

This is not a rule, however, because for some episodes the SO 2 pollution picks 
may occur at W-NW flows as an impact of Greek pollution sources (see Table 2). 
Quite the good agreement of the experimental and simulated columnar SO 2 
evolution for the chosen episodes (see Fig.l.), as well as for the longer time 
period of July, August 1994 [10], ensures that the evaluations made for the SO 2 
origin are also correct. The method of the functions of influence is applied for the 
purpose. The estimations show that in the days with NE winds only a little more 
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Table 1. The mean columnar SO 2 in Thessaloniki (for the whole period and for the 
days with different wind directions) and the corresponding relative contribution of the 
days with different wind direction to the mean columnar SO 2 estimated for July. 



wind direction 


all 


E 


NE 


N 


NW 


W 


SW 


S 


SE 


number of days 


58 


3 


23 


7 


10 


9 


3 


1 


2 


mean columnar SO 2 [m — atm — cm] 


1.90 


1.77 


2.35 


1.28 


1.67 


1.31 


1.74 


1.41 


0.87 


relative contribution [%] 


100. 


5.05 


51.47 


8.51 


15.83 


11.17 


4.96 


1.33 


1.64 



Table 2. Contribution [%] of different countries to the columnar SO 2 in Thessaloniki 
estimated for two episodes in August 1994. 



Period 


BG 


GR 


TUR 


ALB 


YUG 


ROM 


MOL 


UKR 


4-5 August 


44.46 


18.53 


0.13 


0 


0.04 


26.56 


5.98 


4.36 


26-28 August 


0.12 


77.9 


0 


1.55 


19.13 


1.29 


0 


0 



then 20% of the mean columnar SO 2 in Thessaloniki is from sources In the close 
150 km surrounding of the city (i.e. more or less local sources), while the major 
part comes from sources located to North and North-East from Thessaloniki. 
For the situations with non - NE winds the contribution of the local sources 
usually is much bigger - averaged for all the non-NE days it is about 50%. For 
the whole of the studied period the contribution of the local sources is a little 
less then 40%, with significant contribution also from sources situated to N, NE 
and SW from Thessaloniki. 

3.2 Some Long-Term Estimations 

of the Bulgarian-Greek Pollution Exchange 

Calculations were made of the SO 2 of the Balkan Peninsula from both Greek 
and Bulgarian sources for 1995 and the country to country pollution budget 
diagrams were build [8]. 

The space distribution of the annual concentration in air and in precipitation 
due to Bulgarian sulfur sources for 1995 is presented in Fig. 2 It can be noticed 
that the maximums are in the region of the most powerful thermal power plants 
“Maritsa-Iztok” . A secondary maximum is observed over the region of Sofia. 
The impact of Bulgarian sources in SOx concentrations over Greece is relatively 
high. The concentration levels are about an order of magnitude less than the 
maximums over Northern Greece and 1.5-2 orders of magnitude less than the 
maximum over the other part of the country. 

The distribution of these loads between the different territories is displayed 
in Table 3, where the month-by-month variations can be seen, too. The last 
row and column in the table show the percentage of deposed quantities from 
the yearly emitted by Bulgaria quantity, which is estimated to 750 kt for 1995. 
It can be seen that about 27% from the emitted sulfur oxides are deposited 
over Bulgaria itself; other 27% are deposited in the neighborhood; the rest goes 
out of the model region. Greece receives less than 4 percents of the produced 
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Fig. 2. Concentrations in air [)j,gSm and concentrations in precipitation [mgSl 
for 1995 due to Bulgarian sulfur sources. 




Table 3. Monthly distribution of the Bulgarian impact in sulfur pollution of SE Europe 
for 1995. Total deposition in [kt], Bulgarian total emission: 748.631 ktS/year. 





,JAN 


FEB 


MAR 


APR 


MAY 


JUN 


JUL 


AUG 


SEP 


OCT 


NOV 


DEC 


Year 


emit 

[%] 


BG 


23.7 


12.4 


20.5 


17.0 


13.5 


19.3 


16.1 


16.4 


12.3 


6.7 


18.9 


23.8 


200.7 


26.8 


GR 


4.0 


1.6 


3.1 


2.2 


2.1 


0.9 


3.9 


1.6 


0.6 


4.1 


2.3 


1.9 


28.3 


3.8 


Total 


50.2 


30.0 


40.1 


34.5 


25.3 


31.2 


25.0 


26.6 


29.5 


20.7 


43.1 


50.8 


406.8 


54.3 



in Bulgarian pollution which is estimated to 28 kt as sulfur. This quantity is 
deposed mainly over the Northern Greece and on the neighboring sea regions. 
As to the annual variation of this loads, a not very expressed maximum can be 
noticed in the winter with minimum in summer-autumn time. 

The space distribution of the annual sulfur concentrations in air and in pre- 
cipitation due to Greek sources for 1995 is presented in Fig. 3. It can be noticed 
that only 2% of Greece emitted sulfur compounds are deposited over Bulgaria, 
as quantity estimated to 6.2 kt. (Table 4) 



Table 4. Monthly distribution of the Greek impact in sulfur pollution of SE Europe 
for 1995. Total deposition in [kt], Greek total emission: 304.672 ktS/year. 





JAN 


FEB 


MAR 


APR 


MAY 


JUN 


JUL 


AUG 


SEP 


OCT 


NOV 


DEC 


Year 


emit 

[%] 


BG 


0.7 


0.6 


0.6 


0.4 


0.6 


0.8 


0.1 


0.4 


0.4 


0 


0.8 


0.9 


6.2 


2.1 


GR 


5.8 


2.9 


5.2 


4.6 


3.8 


3.1 


4.2 


4.5 


3.9 


3.2 


5.4 


6.7 


53.3 


17.6 


Total 


16.9 


10.2 


14.0 


10.5 


8.9 


6.9 


7.2 


9.2 


9.8 


6.2 


14.4 


17.5 


131.7 


43.2 
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Fig. 3. Concentrations in air [fj,gSm and concentrations in precipitation [mgSl 
for 1995 due to Greek sulfur sources. 



It can be seen from the 10-year report of EMEP/MSC-W that the estimated 
exchange of sulfur pollution between both countries is estimated correctly as 
order of magnitude, giving in the same time much more details in time and 
space distribution of deposed quantities. 

It is well known that Sulfur pollution is the most severe environmental prob- 
lem for the region and so the studies were concentrated mostly on it. Nevertheless 
calculations of the transport of Nitrogen compounds over the Balkan peninsula 
from both Greek and Bulgarian sources for two months of 1995 were also carried 
out and the country to country pollution budget diagrams were build [5]. An 
example of some of the obtained pollution characteristics is given in Fig. 4. 

It is easier to make comments on the integral characteristics [5] . The estima- 
tions showed that in both the months the Sulfur and Nitrogen air pollution of 
Bulgaria is mostly due to home sources, especially in July. 

The estimations for July should be especially underlined. The fact that the 
pollution of Bulgaria is practically due from her one sources can be explained 
only by the specific synoptic conditions in the region during the time period - 
namely steady E - NE flows during almost the whole July 1995. 

The results should be considered as preliminary one. They still have to be 
proved by some comparisons with air pollution measurements data. Qualita- 
tively, however, they agree with the above quoted sulfur exchange studies, which 
is an evidence that the simplified pollution transport model with a more complex 
chemistry, used in the study, will be perhaps suitable for further more extensive 
studies of the pollution transport over the Balkan Peninsula. 



3.3 Cases of Larger Mean SO 2 Surface Concentrations 
over Bulgaria or Northern Greece 

10 days with larger mean S02 surface concentrations over Bulgaria and North- 
ern Greece were chosen and, by using the method of the functions of influence 
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Fig. 4. Dry deposition of Sulfur [figSm ^], Nitrogen [ggNm and Ammonia 
[figNm~'^] for January (a.) and July (b.) 1995. 



(see Figs. 5, 6., where the time integrated functions of influence for some of the 
episodes are demonstrated) further specification of the contribution of the dif- 
ferent sources in both the countries to these extreme pollution was made [3]. It 
was And out (see Table 5) that extremely large mean surface concentrations in 
Bulgaria may occur under different synoptic conditions and dominating flows, 
while the larger mean surface SO 2 concentrations in Northern Greece are as a 
rule connected with situations of steady N - NE flows above Southern Bulgaria 
and Northern Greece. According to the performed evaluations more than 50% 
of extreme surface SO 2 pollution in Northern Greece (up to 82% in some cases) 
originates from Bulgarian sources, with a particularly large contribution of the 
“Maritza” power plants (in one of the episodes nearly 80%). There are cases, 
however, when the SO 2 pollution in Northern Greece is formed mostly (more 
then 60%) from Greek sources. In all the cases studied, more than 70% of the 
extreme SO 2 pollution in Bulgaria is caused by Bulgarian sources. 



3.4 Mesoscale Effects 

It is well known that, due to the complex topography of the region, the mesoscale 
processes significantly affect not only the detailed pollution pattern, but also 
larger scale pollution characteristics in the Balkan peninsula. That is why some 
preliminary studies of possible mesoscale effects on the pollution exchange be- 
tween Bulgaria and Northern Greece were also carried out [4] . It was found that 
the mesoscale wind held disturbances are well organized and reflect some typical 



334 



Christos Zerefos et al. 




Fig. 5. Horizontal cross-sections of the time integrated functions of influence [s] at 
2 = 200m and the contribution [%] of the sources in the corresponding EMEP cell to 
the 24h mean, spatially averaged surface SO 2 concentration concerning some episodes 
with large surface concentration in Bulgaria for 1995. 





Fig. 6. Horizontal cross-sections of the time integrated functions of influence [s] at 
2 = 200mand the contribution [%] of the sources in the corresponding EMEP cell to 
the 24h mean, spatially averaged surface SO 2 concentration concerning some episodes 
with large surface concentration in Northern Greece for 1995. 
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Table 5. Contribution of Bulgarian and Northern Greece sources to the mean surface 
concentration in the days with larger concentrations in Bulgaria or Northern Greece. 



Days with larger mean surface 
concentration for Bulgaria 


Days with larger mean surface 
concentration for Northern Greece 


date 


G 


Bg sources 
[%] 


NGr sources 
[%] 


date 


G 


Bg sources 
[%] 


NGr sources 
[%] 


28.03. 


6.67. 


90.87 


8.71 


15.04. 


7.04 


38.24 


32.67 


03.11. 


6.30. 


79.69 


5.75 


23.10. 


6.83 


82.04 


17.85 


07.12. 


5.67. 


98.39 


1.13 


24.10. 


4.91 


80.19 


19.80 


01.12. 


5.13. 


94.76 


0.75 


29.05. 


4.55 


71.41 


25.73 


21.12. 


4.81. 


95.68 


2.41 


08.10. 


4.34 


80.38 


19.59 


26.04. 


4.48. 


77.61 


5.54 


13.12. 


4.29 


42.32 


30.52 


09.11. 


4.47. 


97.38 


2.62 


27.10. 


4.01 


57.61 


41.43 


28.12. 


4.31. 


90.56 


7.87 


17.10. 


3.54 


73.17 


26.52 


05.12. 


4.13. 


85.19 


0.13 


21.11. 


3.17 


73.44 


26.56 


23.11. 


3.96. 


96.98 


3.02 


22.11. 


3.06 


65.34 


36.65 



effects of the air flows interaction with heterogeneous underlying surface - sea 
breeze effects, blocking by the mountains, formation of local scale vortexes, even 
tendencies of air flow channeling along the Danube river or the Bosphorus and 
the Dardanelles. The comparison of the orders of the wind velocity and the cor- 
responding differences shows that the mesoscale effects are not only qualitatively 
but also quantitatively well displayed . The mesoscale effects on the air pollution 
are also significant. Even for larger domains - the whole territory of Bulgaria or 
Northern Greece, the mesoscale corrections can be pretty big - up to 70% for 
some of the integral pollution characteristics. 

4 Conclusions 

The results reported above demonstrate that the tasks of the joint research 
project were, in general, successfully accomplished. They can be directly used 
in decision making, negotiating and contamination strategies development and 
so are a scientific basis of joint Bulgarian - Greek policy for reduction of the 
air pollution in the region. Still there are many questions about the pollution 
exchange between the countries, which remain unanswered, as well as ideas which 
arouse in the course of the current joint studies, which is a good motivation for 
continuing the collaboration between the Greek and Bulgarian teams. 
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Abstract. The ozone pollution may cause damages on plants, animals 
and human beings when certain critical levels are exceeded. Therefore, it 
is important to study the actual ozone levels and the relationship between 
related emissions and the high levels of the ozone concentrations. 

Two versions of the Danish Eulerian Model (DEM) are used in this 
study. The fine-grid version of DEM uses grid of 480 x 480 points (10 
km resolution). The coarse-grid version of DEM is defined on a 96 x 
QQgrid (50 km resolution). In the paper, an attempt is made as to answer 
the following three questions: (a) Where in Bulgaria and in Europe the 
highest levels of the ozone concentrations are located? (b) How big is 
the influence of the European emission sources on the pollution levels in 
the different parts of Bulgaria? (c) Is it possible to evaluate the changes 
of the pollution levels in Bulgaria and in Europe when the predicted for 
2010 European emissions are used? 

1 Introduction 

High ozone concentrations can cause damages on plants, animals and human 
health. In fact, when the effects from high ozone levels are studied, one should 
look not at the ozone concentrations but on some related quantities. The follow- 
ing four quantities (high ozone level indicators) are important [1-3,5]: 

— AOT40C - Accumulated over threshold of 40 ppb hourly mean values of 
O 3 concentrations during the day-time period from May 1 to July 31 values. 
The crops are damaged when AOT40C exceeds 3000 ppb. hours. 

— AOT40F - the same as AOT40C but accumulated during the period from 
April 1 to September 31. Forests are damaged when AOT40F exceeds 10000 
ppb. hours. 

— NOD60 - Number of days in which the averaged over eight successive hours 
ozone concentration exceeds at least once the critical value of 60 ppb. If the 
limit of 60 ppb is exceeded at least in one 8 -hour period during a given 
day, then the day is called “bad” . People with asthmatic diseases have dif- 
ficulties in “bad” days. It is desirable the “bad” days not to exceed 20 per 
year. It turns out that it is difficult to satisfy even this relaxed requirement. 
Removing all “bad” days is a too ambitious task. 
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— ADOM - Averaged daily maxima of the ozone concentrations in the period 
from April 1 to September 30. This quantity is not directly related to some 
particular damaging effects. It is used to validate the model results. The 
model validation related to the other quantities is rather difficult as it can 
be seen later. 

Three emission scenarios are run using the Danish Eulerian Model (DEM) 
discretized both on coarse (96 x 96) and fine (480 x 480) grid [6]: 

— In the first scenario, called Basic Scenario, the emissions for 1997 from the 
EMEP inventories [4], are used. 

— In the second scenario, called Bulgarian Scenario, the emissions for 1997 
from the EMEP inventories are used, but the Bulgarian emissions are set to 
zero. 

— The third scenario. Scenario 2010, is obtained by modifying the EMEP 
emissions for 1990 by the factors given in [1]. 

In all scenarios the meteorology for 1997 is used. 

2 Validation of the Model Results 

Comparisons of calculated by DEM quantities that are related to the high ozone 
concentrations with corresponding measurements taken at many EMEP stations 
located in different European countries are presented in this section. Only data 
from “representative” stations are utilized. 

When AOT40C and AOT40F values are compared a station is considered 
representative if it measures at least 50% of the total number of hours in the 
respective period. 

When NOD60 values are compared one requires that at least 50% of the 
8-hour averages are available. This is a stronger requirement than in AOT40 
cases because if one measurement is missing then eight 8-hour averages cannot 
be calculated. 

When ADOM values are compared a station is considered representative if 
50% of the daily maxima can be found. As to calculate the maximum for a given 
day, at least 20 hourly measurements must be available. 

Note that the above criteria are different from the criteria used for repre- 
sentative stations in [6]. There, the daily mean values of the measurements are 
tested. In the present study the hourly mean values of the ozone concentrations 
are handled. 

The results from the comparisons are summarized in Table 1. 

It is clearly seen from the table, that the model overestimates AOT40 indi- 
cators. Calculations on the fine grid are somewhat closer to the measured values 
and the correlation coefficients are higher. This increase is substantially higher 
in case of ADOM indicator. But the real benefit from the enormous amount of 
calculations on the fine grid can be estimated when space distribution of various 
indexes is dealt with, especially for small areas. 
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Table 1. Comparison of model results with measurements 



Compared 


Number 


Computed 


Computed 


Measured 


Correlation 


Correlation 


quantity 


stations 


96 X 96 grid 


480 X 480 grid 


mean 


96 X 96 grid 


480 X 480 grid 


AOT40C 


90 


9786 


9056 


7119 


0.50 


0.52 


AOT40F 


92 


17311 


15593 


12731 


0.51 


0.56 


NOD60 


81 


20.27 


19.14 


20.99 


0.56 


0.64 


ADOM 


97 


44 


49 


48 


0.54 


0.74 



The low correlations and its negligible increase when fine-grid DEM is ex- 
ploited can be explained as an effect of two factors: the numerical instability of 
AOT40 and NOD60 determination (both calculated and measured) and the ne- 
cessity to have regular measurements. When the ozone concentration is close to 
the threshold (40 or 60 ppb) small errors in the calculated or measured values can 
lead to considerable errors in determining the respective indicators (small differ- 
ence of big numbers). From the other hand, if some of the necessary measured 
values are missing, the calculation of AOT40 and the classification “good/bad 
day” is rather uncertain. These two facts (the numerical instability of the results 
and the necessity to have regular measurements) limit the number of represen- 
tative stations and make the validation of the AOT40 and NOD60 values rather 
difficult. This explains why the correlation factors are dropped from about 0.60 
-0.80 when the mean concentrations of the major non-ozone pollutants are com- 
pared [6] to about 0.50 - 0.56 when high level ozone indicators are compared. 
It also explains why it is considerably easier to validate the results related to 
averaged daily maxima of the ozone concentrations. 

Further on only results for AOT40C calculations will be presented. 



3 Distributions of AOT40C Values in Bulgaria 
and in Europe 

As stated above, when AOT40C values (accumulated in May-July) exceed 3000 
ppb. hours the crops are damaged, while the exceeding of AOT40F values (accu- 
mulated in April-September) over 10000 ppb. hours indicates damaging forests. 
Here, only the results related to the AOT40C will be presented (the corre- 
sponding AOT40F fields are quite similar). In fact, the scaled AOT40C values 
(AOT40C/3000) will be discussed. In this way, it will be immediately clear in 
which areas of Europe or in Bulgaria the critical limit is exceeded and where the 
exceedances are greatest. The results are presented in Fig. 1-8. 

The conclusions that can be drawn after studying these results are: 

— As can be seen from Fig. 1 and 2, the general pattern of the distribution of 
the scaled AOT40C values (Basic Scenario) is very similar for the fine and 
coarse grids, the fine grid providing smoother distributions. 

— In the highly polluted parts of Europe (Central and Western Europe) the 
critical level of 3000 ppb. hours is exceeded by a factor up to seven. 
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Fig. 1. Scaled AOT40C values for whole Fig. 2. Scaled AOT40C values for whole 
Europe and (50km x 50km) grid Europe and (lOfcm x 10km) grid 





Fig. 3. Scaled AOT40C values for sur- Fig. 4. Scaled AOT40C values for sur- 
rounding Bulgaria area and (50km x 50km) rounding Bulgaria area and ( lOfcm x lOfcm) 
grid grid 



— When zoom is made and smaller regions are considered the benefit of using 
fine grid is extremely pronounced, as can be seen from Fig. 3 and 4. They 
show also that the AOT40C levels in Bulgaria are quite low in comparison 
with Western Europe. The critical value is exceeded only in some parts of 
Bulgaria. It must be noted that this results refer 1997, in some other years, 
especially in the beginning of 90ies, the critical limit was exceeded by a factor 
greater than two in the whole territory of Bulgaria. 

— Fig. 5 shows the contribution of the European sources to the AOT40C lev- 
els in Bulgaria (Bulgarian Scenario vs. Basic Scenarios). It varies from 6% 
to 91%. The contribution is greater in the Western and Northern parts of 
Bulgaria. On the other hand, it can be noticed that the contribution of Bul- 
garian sources to the AOT40C levels in some parts of the Northern Greece 
and in the European part of Turkey seems to be considerable. 
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Fig. 5. Contribution from European sour- Fig. 6. Changes of the AOT40C levels for 
ces to the Bulgarian AOT40C levels surrounding Bulgaria area when emissions 

from Scenario 2010 are used 





Fig. 7. Scaled AOT40C values for whole Fig. 8. Changes of the AOT40C levels. 
Europe and (lOfcm x lOfcm) grid when Emissions from Scenario 2010 
emissions from Scenario 2010 are used 



— Fig. 6 presents the changes that the expected emissions for 2010 seem to 
cause (Scenario 2010 vs. Basic Scenario). An increase of the AOT40C values 
in Bulgaria is manifested. This is a common behaviour for most of the coun- 
tries in the Eastern Europe (see Fig. 7). The reason for this behaviour can 
be explain by the fact that, due to the economy crisis in these countries, the 
emissions were reduced very considerably in 90ies. The reductions were so 
great that emissions in 1997 became smaller than the expected for 2010 val- 
ues. This is illustrated in Table 2 where the emission tendencies for Bulgaria 
and Germany are shown (more details can be found in [4]). 
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Table 2. Bulgarian and German emission levels. The emissions for 1990 and 1997 
are taken from [4], The expected for 2010 emissions are calculated by multiplying the 
1990emissions with the factors given in [1]. 



Emissions in 1000 tonnes per year 


1990 


1997 


2010 


Bulgaria 


NOX 


361 


225 


303 


VOC 


217 


120 


210 


Germany 


NOX 


2709 


1846 


1192 


VOC 


3225 


1779 


1161 



— The run of the model with expected for 2010 emissions indicate that the 
AOT40C values will be reduced considerable in Central and Western Europe 
(see Fig. 7 and 8), in some areas by more than 60%. This behaviour becomes 
clear when the emission levels changes for Germany, given in Table 2, are 
analysed. A stable reduction in time is seen, there. Similar results can be 
found for other countries in Central and Western Europe (see again [4]). 
However, it must also be emphasized that the rather large relative reductions 
do not lead everywhere to safety AOT40C levels. Indeed, the results in Fig. 7 
show that in some parts of Central and Western Europe the AOT40C levels 
will be still very high; exceeding the critical level of 3000 ppb. hours by a 
factor up to six. 

In the previous section it was clearly shown that the AOT40C values are very 
sensitive to small numerical errors in the calculation of the ozone concentrations. 
That means also that small changes of the ozone concentrations, due to some 
regulative measures, might lead to considerable reductions of the AOT40C values 
(the results in Fig. 8 confirm this conclusions). However, one should be very 
careful when general conclusions are to be drawn. The big problem is that the 
AOT40C values are accumulated over a rather short time period (only three 
months). It is quite natural that the ozone concentrations can be changed due 
to the meteorological conditions changes from one year to another. It is not 
sufficient to study the AOT40C levels as a function of the emissions only, i.e. to 
study these levels by using only emission scenarios which is often the case (see, 
for example [1]). The problem is much more complicated and it is necessary to 
study the AOT40 levels as a function of at least both the emissions and the 
meteorological conditions, when regulative measures are to be defined and used 
in practice. 

The great influence of the meteorological conditions on the AOT40 values 
can be seen on Fig. 9. The AOT40C values calculated by the coarse grid version 
of DEM are compared with measurements. It is seen that both the measured 
and calculated results vary considerably from one year to another. The emissions 
in Denmark and in Europe were reduced in the period from 1989 to 1998. The 
trend of reductions is practically not seen when the measurements are taken 
into considerations. A reduction trend is seen for the calculated AOT40C values. 
However, the variations from one year to another are much more clearly seen 
than the trend of reduction. 
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ULFBORG CALC; OBS: 

FREDERIKSBORG CALC; OBS; 

The whole of Detilmok; 



ULFBORG CALC; OBS; 

FREDERIKSBORG CALC; OBS; 

The whole of Detilmok; 



Fig. 9. AOT40C values (AOT40 for crops) in the period from 1989 to 1998. Variations 
of the AOT40C values 



4 Some Concluding Remarks 

The planned reductions of the emissions in Europe (the emissions from Scenario 
2010) are probably not sufficient in the efforts to reduce the high ozone levels in 
Europe to safe limits. On the other hand, the emissions from Scenario 2010 seem 
in general to be quite sufficient to reduce the ozone concentrations in Bulgaria 
to safe levels. 

Common actions, in which all European countries tries to find control strate- 
gies for keeping the high pollution levels under prescribed safe limits, seem to be 
necessary. Indeed, the effect of using Scenario 2010 for the European emissions 
is in many cases more beneficial for the Bulgarian area than the optimal results, 
which can be achieved by Bulgaria alone (i.e. the results obtained by setting all 
Bulgarian emissions to zero). 

More complex analysis is needed in order to decide what precisely is wanted 
and how to reduce in an optimal way the ozone concentrations to safe levels. 
Many scenarios with different meteorological data have also to be run in the 
attempts to resolve successfully this task. 
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Abstract. We investigate collocation methods for the efficient solution 
of singular boundary value problems with an essential singularity. We 
give numerical evidence that this approach indeed yields high order so- 
lutions. Moreover, we discuss the issue of a posteriori error estimation for 
the collocation solution. An estimate based on the defect correction prin- 
ciple, which has been successfully applied to problems with a singularity 
of the first kind, is less robust with respect to an essential singularity 
than a classical strategy based on mesh halving. 



1 Introduction 

We consider boundary value problems with an essential singularity (or singular- 
ity of the second kind), 



t°‘z'{t) = f{t,z{t)), te(o,i], 


(1) 


5(^(0),Z(1)) = 0, 


(2) 


Z€C[0,1], 


(3) 



where a > 1. f and g are smooth functions of dimension n and p, respectively. In 
general, p < n holds and condition (3) provides the additional n—p relations that 
guarantee the well-posedness of the problem. Analytical results for problems of 
this type have been discussed in detail in [8]. For the numerical treatment, we 
assume that an isolated solution of (l)-(3) exists. 

Boundary value problems with an essential singularity are frequently encoun- 
tered in applications. In particular, problems posed on infinite intervals are often 
transformed to this problem class. Here, we would like to mention a problem in 
foundation engineering discussed in [9], which is of use for the design of oil rigs 
above the ocean floor (see also Example 1 below). Models from fluid dynamics 
are another source for the problems we are interested in: The Blasius equation 
describes the laminar boundary layer on a flat plate, see for example [11]. The 
von Karman swirling flows result from the Navier-Stokes equations for a sta- 
tionary, axisymmetric flow of a viscous incompressible fluid occupying the half 
space over an infinite rotating disk, cf. [10]. Finally, one approach to the classical 
electromagnetic self-interaction problem ([5]) leads to a boundary value problem 
with a singularity of the first kind (a = 1 in (1)) at t = 0 and a singularity of 
the second kind due to the formulation on an infinite interval. 
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In this paper, we investigate numerical methods which may be successfully 
applied to obtain high-order solutions for boundary value problems of the type 
(l)-(3). Particularly, in §2 we examine the empirical convergence order of col- 
location methods at either equidistant or Gaussian points. These methods work 
satisfactorily when they are applied to boundary value problems with a singu- 
larity of the first kind, see [3]. Moreover, collocation has been implemented in 
the MATLAB code sbvp for the latter class of problems, see [1]. This code also 
uses an a posteriori error estimate based on defect correction, which has been 
analyzed for singularities of the first kind in [4]. In §3, we show that this esti- 
mate does not work for problems with an essential singularity. A modification of 
this idea is not satisfactory either, as demonstrated in §3.2. However, a strategy 
based on mesh halving is shown to be a promising candidate for an asymptoti- 
cally correct error estimate for the collocation solution of (l)-(3), see §3.3. 



2 Convergence of Collocation Methods 



In this section, we discuss polynomial collocation with maximal degree m S IM for 
problems on grids A := {tij = ti + pjh, i = 0, . . . ,N — 1, j = 0,. . . , m} U {tAr}, 
where h := 1/N, ti := hi and 0 = po < pi < ■ ■ ■ < pm < 1- This means 
that we approximate the analytical solution by a continuous collocating func- 
tion p{t) •= Pi{t), t € [ti, ti+i], f = 0, . . . , — 1, where pi is a polynomial of 

maximal degree m, which satisfies the differential equation (1) at the collocation 
points tij, i = 0, . . . , iV — 1, j = 1, . . . , m, and the boundary conditions. In this 
setting, a convergence order of 0{K^) can be guaranteed for regular problems 
with appropriately smooth data. However, when the collocation points are suit- 
ably chosen {Gaussian points) even a (super-) convergence order 0{h^'^) holds 
at the mesh points see [6]. 

For problems with an essential singularity, we observed that the convergence 
order is at least for numerical evidence see [2]. We illustrate this con- 

vergence behavior in Table 1. The results were computed for the problem from 
foundation engineering specified in [9] , 

Example 1 



z'{t) 



1 



/ -Z2{t) \ 

-zz{t) 

-Zi{t) 

\1 



/I 0 0 0\ 
0 10 0 
0 0 0 0 
\0 0 0 0 / 



^( 0 ) + 



/O 0 0 0\ 
0 0 0 0 
0 0 10 
\0 0 0 1 / 



^(1) 



l^\ 

0 

0 ’ 

V) 



(4) 



(5) 



where collocation at four equidistant points pj = j'/5, j = 1, • • • ,4 was applied. 
All computations in this paper were performed using MATLAB 6.1 in IEEE 
double precision with relative machine precision EPS« 1. lie— 16. In Table 1, h 
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Table 1. Convergence order of collocation at four equidistant points for Example 1 



h S p 5 p 

1/10 2.06e-17 2.06e-17 

1/20 4.26e-18 2.27 4.26e-18 2.27 
1/40 1.65e-19 4.69 1.65e-19 4.69 
1/80 6.22e-21 4.73 6.22e-21 4.73 
1/160 3.84e-22 4.02 3.74e-22 4.06 



Table 2. Convergence order of collocation at four Gaussian points for Example 2 



h S p 5 p 

1/10 2.12e-09 2.12e-09 

1/20 9.07e-ll 4.55 9.07e-ll 4.55 
1/40 3.92e-12 4.53 3.92e-12 4.53 
1/80 1.71e-13 4.52 1.71e-13 4.52 
1/160 5.99e-15 4.84 5.99e-15 4.84 



denotes the step size, S denotes the maximal global error at the grid A (computed 
with respect to a reference solution which was determined for h = 1/320), and 
p is the empirical convergence order determined from the values of 6 for two 
consecutive step sizes. S and p denote the respective quantities computed for 
the error at the mesh points ti, i = 0, . . . , N, only. Note that for our choice of 
collocation points, no superconvergence effects are to be expected. 

Collocation at Gaussian points is affected by order reductions as compared 
to the classical superconvergence order. We demonstrate this observation using 
a simple test example. 

Example 2 



^'(^) = + ( 6 ) 
z(l)=e, (7) 

with the exact solution z{t) = e*. We conjecture that in general, the convergence 
order for Gaussian points is to + 7, where 0 < 7 = 7(0) < 1, and 7 decreases 
with increasing a. Table 2 shows the results for a = 3 and collocation at four 
Gaussian points. Note that the maximal error is assumed throughout at mesh 
points, which implies that the convergence order p is no higher than the uniform 
convergence order p. 

Remark 1. The analysis of the box scheme given in [7] implies that its order of 
convergence is 1 + 7, where 0 < 7 < 1. Since the box scheme is equivalent to 
collocation at Gaussian points with to = 1, this is consistent with the above 
conjecture. 
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Fig. 1. Exact global error and the error obtained by sbvp for Example 2 



3 A Posteriori Error Estimation 

Here, we discuss several error estimation strategies for the numerical solution 
computed by collocation. First of all, we present the results obtained with our 
MATLAB code sbvp. The graphs in Figure 1 demonstrate that the error estimate 
implemented in sbvp, see §3.1 below, is completely unreliable for Example 2 with 
a = 3. The value of the error estimate (denoted by “sbvp”) is of the order of 
magnitude of 10®°, while the value of the exact global error (“exact”) is of the 
order of magnitude of the round-off error. 



3.1 Defect Correction Using the Backward Euler Method 

The error estimation routine implemented in sbvp is based on the defect cor- 
rection principle and uses the backward Euler method as an auxiliary scheme. 
This estimate was introduced in [3] and has been analyzed for problems with 
a singularity of the first kind in [4]. The numerical solution p(t) obtained by 
collocation is used to define a “neighboring problem” to (1)“(3). The original 
and the neighboring problem are solved using the backward Euler method at the 
points tij, j = 1, . . . , m and := ti+i, f = 0, . . . , A — 1. This yields the 

grid vectors and nij as the solutions of the respective schemes 









( 8 ) 






,a j ) A ’ 



where dij is a defect term defined by 



dij 






m+l ^ 

0'j,kT^ f{'ti,k,p{ti,k))- 



( 9 ) 






( 10 ) 
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Here, the coefficients are chosen in such a way that the quadrature rules 
given by 

- — — / ifir) dr aj^k^iU^k) 

have precision m + 1. The quantities — iTij serve as estimates for the global 
error of the collocation solution at the grid points, which is 0(/i™) in general. 
For regular problems and for a certain class of problems with a singularity of 
the first kind, this error estimate was shown to satisfy max^j- \(z(tij) —p{tij)) — 
- T^i,j)\ = 0(/i""+i), cf. [3] and [4], 

The failure of this error estimate for problems with an essential singularity 
can clearly be attributed to the fact that the backward Euler method does not 
work for this problem class. It is clear from Table 3 that the method applied to 
Example 2 with a = 3 is rapidly divergent. 



Table 3. Numerical results for the backward Euler method for Example 2 



h S p 5 p 

1/10 7.20e+00 7.20e+00 

1/20 5.93e+01 -3.04 5.93e+01 -3.04 
1/40 1.73e+05 -11.5 1.73e+05 -11.5 
1/80 8.70e+09 -15.6 8.70e+09 -15.6 
1/160 5.99e+17 -26.0 5.99e+17 -26.0 



The failure of the backward Euler method for Example 2 is apparently due to 
the instability of the numerical integration for this terminal value problem. For 
the solution of the associated difference scheme, terms of the form (1 — h/tf) are 
accumulated. For a > 1, these terms are unbounded for — s- 0. This possibly 

explains why this scheme works perfectly well for the cases where a = 1, but 
fails for an essential singularity. An additional drawback of the backward Euler 
scheme is the condition number of the associated system of linear algebraic 
equations, which becomes intolerably large for decreasing step size h. In Table 4 
the condition number with respect to the maximum norm, “cond” , and its order 
of growth, “p cond” are recorded. The norm of the system matrix “norm” is 
0{h~^) = while the norm of the inverse “norm inv” increases very 

rapidly. 

3.2 Defect Correction Based on the Box Scheme 

When comparing the observations for the backward Euler method in §3.1 with 
the results for collocation, see §2, we may conjecture that a possible remedy 
for the observed instability could be to use symmetric schemes. In Table 5, we 
give the condition numbers “cond” for the box scheme which have the growth 
order p cond = —a. This is the same as for the norm “norm” of the system 
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Table 4. Conditioning of the backward Euler method for Example 2 



h cond 


p cond norm 


p norm norm inv 


1/10 6.99e-h05 
1/20 3.28e-h09 
1/40 1.55e-hl5 
1/80 9.60e-h23 
1/160 5.10e-h37 


1.00e-h03 
-12.2 8.00e-h03 
-18.9 6.40e-h04 
-29.2 5.12e-h05 
-45.6 4.10e+06 


6.99e-h02 
-3.00 4.09e-h05 
-3.00 2.42e-hl0 
-3.00 1.88e-hl8 
-3.00 1.25e-h31 



Table 5. Error of the error estimate based on the box scheme for collocation at four 
equidistant points and conditioning of the auxiliary scheme for Example 2 



h 5 5 err 


p err cond 


p cond norm norm inv 


1/10 l.lle-08 5.35e-09 
1/20 7.02e-10 2.26e-10 
1/40 4.38e-ll 1.02e-ll 
1/80 2.73e-12 4.42e-13 
1/160 1.72e-13 1.90e-14 


2.28e-t03 
4.47 1.68e-t04 
4.55 1.28e-h05 

4.53 9.98e-h05 

4.54 7.89e-h06 


8.00e-t03 2.85e-01 
-2.88 6.40e-t04 2.62e-01 
-2.93 5.12e-h05 2.50e-01 
-2.96 4.10e-h06 2.44e-01 
-2.98 3.28e-h07 2.41e-01 



matrices (not displayed). In contrast to the backward Euler scheme, the inverses 
are bounded in this case, see “norm inv” . 

Thus, we propose the following alternative to the error estimate described 
in §3.1: Instead of solving the original and the neighboring problems using the 
backward Euler method, we use the box scheme to compute the quantities 
and Unfortunately, the result is not fully satisfactory either. The error of 

the error estimate seems to have the order m + 7 , where again 7 decreases for 
growing a. We illustrate this observation in Table 5, where again Example 2 
with a = 3 is discussed. The underlying numerical method is collocation at four 
equidistant points, and we consider the error at the mesh points only. The error 
of the error estimate “5 err” decreases faster than the error of the basic method 
i5 (which is 0 (/i^)), but the difference in the asymptotic orders is not sufficiently 
large to guarantee a reliable error estimate. 



3.3 Error Estimate Based on Mesh Halving 

The negative results for error estimation strategies based on the defect correction 
principle are the motivation to consider a computationally more expensive error 
estimate based on mesh halving. In this approach, we compute the collocation 
solution at m equidistant points on a grid A with step size h and denote this 
approximation by PA{t). Subsequently, we choose a second mesh A 2 with step 
size h/2. On this mesh, we compute the numerical solution based on the same 
collocation scheme to obtain the collocating function pA^it)- Using these two 
quantities, we define 

For this purpose we use the defect dij from (10) for the evaluation of the right-hand 
side at the point {tij-i +tij)j2. 



1 
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Table 6. Global error for collocation at four equidistant points and error of the error 
estimate based on mesh halving for Example 2 



h 5 S S err p err S err p err 

1/2 8.80e-06 8.80e-06 2.90e-07 1.61e-07 

1/4 5.41e-07 4.65e-07 1.16e-08 4.65 1.16e-08 3.80 
1/8 3.02e-08 2.77e-08 3.75e-10 4.95 3.75e-10 4.95 
1/16 1.82e-09 1.71e-09 1.61e-ll 4.54 1.61e-ll 4.54 
1/32 l.lle-10 1.07e-10 6.93e-13 4.54 6.93e-13 4.54 



2P 

^{t) ■= -PA{t)) ( 11 ) 

as an error estimate for the approximation p/\ (t) . Assume that the global error 
S{t) of the collocation solution can be expressed in terms of the principal error 
function e{t), 

S{t) = e{t)h^ + 0{h"^+^), ( 12 ) 

where e{t) is independent of h. Then obviously the quantity £{t) satisfies £{t) — 
6{t) = 0(h'”+^). Consequently, for collocation at an even number of equidistant 
points, this error estimate is asymptotically correct. The convergence results for 
collocation methods, see § 2 , suggest that this is a promising approach. 

However, numerical results given in [2] indicate that the higher order term 
in ( 12 ) is rather 0 (h"*+^) with 7 < 1 in case of an essential singularity. Here, we 
only give^ the numerical results for the simple test problem Example 2, and refer 
the reader to [2] for further results. In Table 6 , the error of the error estimate “S 
err” is given together with its asymptotic order “p err” , where the underlying 
numerical solution is computed by collocation at four equidistant points. We 
can see that, similarly as for the box scheme (Table 5), the error of the error 
estimate has order m + 7 with 7 « 0.5. However, the absolute quality of the 
error estimate is sightly better than for the box scheme. The error of the error 
estimate should be compared with the exact global error d at the whole grid and 
at the mesh points (<5). 
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Abstract. A 3D boundary- integral/finite- volume method is presented 
for the simulation of drop dynamics in viscous flows in the presence of 
insoluble surfactants. The concentration of surfactant on the interfaces 
is governed by a convection-diffusion equation, which takes into account 
an extra tangential velocity. The spatial derivatives are discretized by 
a finite-volume method with second-order accuracy on an unstructured 
triangular mesh. Either an Euler explicit or Crank-Nicolson scheme is 
used for time integration. The convection-diffusion and Stokes equations 
are coupled via the interfacial velocity and the gradient in surfactant 
concentration. The coupled velocity - surfactant concentration system is 
solved in a semi-implicit fashion. Tests and comparisons with an analyt- 
ical solution, as well as with simulations in the 2D axisymmetric case, 
are shown. 



1 Introduction 

The presence of a surfactants in multiphase systems has a significant influence 
on the shape and the motion of the interfaces. This has motivated the increasing 
interest to understand the effect of surfactants on the interfacial behavior in 
recent years. 

A large number of investigations has been devoted to the insoluble surfactant 
limit, where the flux bulk - interface is negligible. The mathematical formulation 
of the problem includes the hydrodynamic equations for the velocity within the 
bulk of the fluids and convection-diffusion on evolving interfaces for the surfac- 
tant concentration. These equations are coupled via the interface velocity and a 
dependence of interfacial tension on the surfactant concentration. From a numer- 
ical point of view the solution of such a complex model requires a numerically 
stable method of high accuracy. 

A number of numerical simulations of 2D or axisymmetric flows has been 
reported, e.g. [2]. As far as we know, the only 3D methods are presented by Yon 
& Pozrikidis, see e.g. [4]. In the present study we develop a similar boundary- 
integral/finite- volume method for simulation of deformable interfaces in the pres- 
ence of a insoluble surfactant. 

* This work was supported by the Dutch Polymer Institute, grant #161. 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 355-362, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




356 Ivan B. Bazhlekov, Patrick D. Anderson, and Han E.H. Meijer 

2 Convection-Diffusion of Insoluble Surfactant 

In this section we derive the convection-diffusion equation governing the surfac- 
tant distribution on a deformable interface. A finite-volume method and time 
integration scheme are also presented. Their accuracy is tested for different con- 
centration distributions and interfacial velocity on a sphere. Here, a sphere is 
defined as a closed interface with a constant mean curvature k. 



2.1 Convection-Diffusion on Evolving Interfaces 

An important element of the convection-diffusion process on an evolving interface 
together with the interfacial velocity is the deformation of the interface. Due to 
the interface deformation the interfacial area can locally decrease or increase, 
leading to a change of the surfactant concentration. To derive the convection- 
diffusion equation on a deformable interface, (see also [4]) we start from the 
conservation law on an element S of the interface: 

^ f rds = D, f h v.rdi, (1) 

where F is the concentration of the surfactant. The material derivative D/Dt 
expresses the rate of change of the surfactant on S that moves with the interface 
velocity. The r.h.s. of (1) expresses the diffusive flux across the contour C of S, 
according to Tick’s law of diffusion. The vector b is the unit normal outward to 
C lying in the tangential to S plane (see figure 1), Ds is the diffusion coefficient 
and Vs is the surface gradient: 

Vs = (/-nn)-V, (2) 

where n is the unit vector normal to the interface S. 

Interchanging D/Dt with the integration in the r.h.s. of (I) yields: 



TVs • u I ds = Da 



b • VsT dl, 



where u is the hydrodynamic velocity. After converting the contour integral in 
the r.h.s. into a surface integral we can write equation (3) in differential form: 



= -TVs 



DaVir. 



In a dynamic simulation based on Stokes flow, it is convenient and kinematically 
consistent to move the nodal points with the hydrodynamic velocity plus an 
arbitrary tangential velocity w (see [4, 3] and section 3), i.e. v = u -|- w. 

The rate of change of the concentration F following a nodal point is: 



dF 
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which, incorporated in (4), gives: 

(^) =^-^sr-r\/s-u + Ds\/lr. (5) 

The hydrodynamic velocity in (5) can be decomposed into normal and tangential 
parts, u = (u • n)u + u^, and using the definition of the mean curvature of the 
interface k = • n, equation (5) now takes the final form: 

= (w + Us) • VsT - Vs • (Tus) - 2ku ■ n + DsVIP. (6) 

This equation governs the convection-diffusion of an insoluble surfactant of con- 
centration r on an evolving interface in a local coordinate system that moves 
with the nodal velocity v and is the starting point for the finite-volume method 
described in the following section. 




2.2 Finite- Volume Formulation 

The finite- volume form of equation (6) is obtained (see for instance [4]) by taking 
the surface integral over an element of the interface S: 

/ (^) d,s = J {w + Us) -S/^rds - J Fus-hdl (7) 

— 2 / rku-nds + —— [ h-VsFdl, 

Js Pe Jc 

where the second and the last term in the r.h.s. are converted into contour 
integrals, using the divergence theorem. Equation (7) is written in dimensionless 
form, where Pe is the Peclet number, a dimensionless group proportional to 
\lP>a- In the following two sections, we discuss approximations of the spatial 
terms and the time derivative of formulation (7). 



2.3 Spatial Discretization 

We consider a discretization of the interface S by surface elements that cor- 
respond to the nodal points x^- . To construct it the interface is first triangulated 
by triangular elements (x^ are the vertices of the triangles). The elements Sj 
are then composed by 1/3 of the triangles to which Xj belongs, see figure 1. The 
mean curvature k{xj) and the normal vector n(xj) are calculated by means of 
a commonly used expression, see for instance [3]: 

k{xj)n{xj)ASj / k{x)n{x) ds = — f hdl. (8) 

Jsi Jc 

For r in each triangle a linear approximation is used, defined by the corre- 
sponding values in the three vertices. Thus, VsT is a constant vector inside each 
triangle. The spatial terms in (7) are approximated as follows: 
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Fig. 1. Interface discretization by triangles with vertices o. The surface element Sj is 
dehned by the centers of mass of the triangles (□) and element sides (■). 



- The first term: 




(w + Us) • VsT ds 



i(w + Us)j ^ (VsT),Z\S'i, 

iGNj 



(9) 



where ASi is the area of the i-th triangle and Nj is the set containing the 
numbers of the triangles to which Xj belongs. 

- The second term: 




Vs • (Tus)ds 



Tus -hdl 



ICi 



E E [ ^ru,di]. 

id Ar , m = ^ 9 C,, 



(10) 



The values of the concentration F and the interface velocity Ug on are taken 
from their linear approximation in the triangle i and the integrals in the r.h.s. 
of (10) are calculated by means of the trapezoidal rule. It is seen from figure 
1 that vector b is constant on the segments C^, m = 1,2, and we denote the 
corresponding values by b^. 

- The third term: 



Fk\i • nds « rjkj{\ij ■ n.j)ASj] ASj = ^ ASi. 



( 11 ) 



iGNj 



- The last term: 



/ VlFds= [ b-VsTds= ^(VsT),(b],Z], + b2,Z|J, (12) 



where l'J\ is the length of the segment CJl, m = 1,2. To evaluate the order of 
approximation of the different terms (9-12), the following test is performed: We 
consider uniform meshes on a unit sphere S centered at (a; = 0, y = 0 , 2 = 0). 
The meshes are obtained by triangulating the interface S into N , 320 < N < 
18000, triangular elements. The four terms (9-12) are calculated for T = a; -I- 
1 and Us = w = {x^ — l,xy,xz)/4:, and then divided by ASj to obtain the 
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Fig. 2. The dependence of the relative er- Fig. 3. The dependence of the relative er- 
ror on the number of elements N. ror on the time step. 



corresponding mean values in the nodal points Xj. The relative difference in 
maximum norm between the approximate values and the corresponding exact 
values in the nodal points is shown in figure 2. The velocity u in the third term is 
taken u — n = {x,y, z). Thus, the graph that corresponds to this term Tfc(u • n) 
estimates the approximation of the mean curvature k and the normal vector n. 
Because the space step Ax is inversely proportional to the square root of the 
number of triangles N ((Z\x)^ « we can conclude from figure 2 that the 

approximations (9-12) are of second-order accuracy. 

2.4 Time Integration 

After using the approximations (9-12) the convection-diffusion equation (7) can 
be written in the form: 

dr 

Y, j = l,2,...,iV, (13) 

where the coefficients Aji depend on the coordinates of the nodal points and 
the interface velocity. Here Nj contains the number of the nodal points that are 
directly connected with Xj . For time descretization we use the theta method and 
thus (13) reads: 

rj (t At) = Pj (t) -p At ^ ) Aji ■ [BPi (t -p At) + (1 — 0)Pi (t)] • (14) 

i 

In the present study Euler explicit, 0 = 0, and Crank-Nicolson, 9 = 1/2, methods 
are considered. The Euler explicit method is very simple to realize, however it 
has only first-order approximation of the time derivative. The Crank-Nicolson 
method has second-order accuracy, however it is implicit, i.e. (14) is a system of 
N algebraic equations. An advantage of the system (14) is that the corresponding 
matrix is sparse. For the meshes considered here only 6 or 7 elements per row are 
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non-zero = 5 or 6). This makes the use of an iterative solver very efficient. 

We solve (14) for Fjit + At) by means of the Jacobi or Gauss-Seidel method. 
Both methods give identical results and converge for about 3 — 6 iterations to 
an absolute tolerance of 10“®. 

In order to verify the order of approximation of the time integration, for 0 = 0 
and 9 = 1/2, the following test is performed. Consider convection-diffusion on 
unit sphere of surfactant with initial concentration r(t = 0) = 1. The surfactant 
is convected on the sphere by the interface velocity — l,xy,xz)/A, 

while the diffusion is moderated, Pe = 10. For each method the results for the 
concentration on a mesh of 2000 triangles at time t = 5, obtained by using 
different At, are compared with the one for At = 10“®, and the relative error is 
shown in figure 3. 



2.5 Tests and Comparisons 

The method for solving the convection-diffusion equation, presented in the pre- 
vious section, is tested here for a fixed interface position (unit sphere S centered 
at the origin) and prescribed interface velocity. Tests, similar to these in [4] , are 
performed on a mesh of 2000 triangular elements and At = 10“^. In the first 
test we consider the relaxation of a surfactant with initially non-uniform con- 
centration , r(t = 0) = X -|- 1, towards a uniform one. The process of relaxation 
is due to the diffusion at Pe = 1 in the absence of convection, Ug = w = 0. 
Figure 4 a) shows a good agreement of the present results for the concentration 
profile with the axisymmetric simulation. The second test is designed to check 





Fig. 4. Comparison of the present results (2000 triangular elements and At = 10“®) 
with the axisymmetric simulations {Ad = tt/1000 and At = 10“®) at different time 
instances. The markers correspond to the nodal points in the 3D mesh projected in 
the axisymmetric plane (r,9): a) Relaxation (diffusion); b) Convection-diffusion. 
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both the diffusion, Pe = 10, and convection at a prescribed interface velocity 
U 5 = w = (x'^ — l,xy^xz) /A. Initially, the surfactant is uniformly distributed, 
P{t = 0) = 1. The present results are in excellent agreement with the axisym- 
metric results, which is seen from figure 4 b). The rings in this figure correspond 
to results at t = 10, obtained without extra tangential velocity, w = 0. In this 
case the nodal points are convected towards the rear part of the sphere, 9 = 180°, 
leading to a mesh distortion and inaccurate results. 



3 Boundary Integral Formulation 

The solution of the Stokes equation for interface velocity u(xq), xq G S' is given 
by means of the boundary integral formulation: 

(A + l)u(xo) = 2uoo - ^ / f(x) • G(xo, x)dx (15) 

47T Jg 

A — 1 f 

H — / u(x) • T(xo,x) • n(x)dx. 

47t Js 

Details concerning the notations and the solution procedure of (15) are given 
elsewhere [ 1 ]. 

The presence of an insoluble surfactant plays an important role in the inter- 
facial forces /(x). The surfactant lowers the interfacial tension cr, and also leads 
to interfacial tension gradient VsCT. The latter causes the so called Marangoni 
stresses, that is tangential to the interface. Thus, the interfacial force f(x) con- 
sists of normal and tangential parts, and both of them depend on P : 

./ W = ^ [CT(T)fc(x) • n(x) - Asa{P)] (16) 

where Ca is the capillary number. To close the mathematical model (7) and 
(15-16) we need a relation for cr(T). In the present study we use the simplest 
linear dependence 

a{r) = {l-(3P)/{l-f3). (17) 

Finally, the evolution of the interface is given by the kinematic condition 

fj-v 

— = u(x,t) -b w(x,t), (18) 

at 

where w can be an arbitrary velocity, tangential to the interface. The coupled 
system (7) and (15-18) is solved in a semi-implicit fashion, as follows. For a given 
interface position S(x,t) and surfactant concentration P(x,t), the interfacial 
forces f(x) are calculated via (16-17), and then the interface velocity via (15). 
The velocity u(x, t) is subsequently used in the convection-diffusion equation (7) 
and the kinematic condition (18) to obtain the surfactant concentration and the 
interface position at the next time instance t + At. 
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4 Results 

Two groups of simulations are performed in the present study. First we consider 
drop deformation in simple shear flow Uqo = {y, 0, 0). Figure 5 shows the steady 
drop shape at A = 1, Ca = 0.1, /? = 0.1 and Pe = 1. The results in this 
case are in good agreement with those of [4], shown in their figure 10a). The 




Fig. 5. Steady shape of a drop in simple shear flow. The shade of gray corresponds to 
the concentration F. 



second group of comparisons is with the axisymmetric simulations of the drop 
deformation in axisymmetric extensional flow w = (a;, — y/2, — z/2), presented 
in [2] . The results obtained by the present method are in a good agreement with 
those of [2] regarding drop shape as well as the surfactant concentration. The 
performed simulations indicate a good numerical stability for all values of Pe. 

5 Conclusions 

A 3D boundary-integral/flnite-volume method is developed for solving the cou- 
pled system of convection-diffusion on evolving interfaces for the distribution 
of the surfactant and the Stokes equations for the interface velocity. The tests 
performed show second-order accuracy for the time and spatial terms in the 
convection-diffusion equation. The comparisons with previous 3D and axisym- 
metric simulations indicate the reliability of the presented method. 
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Abstract. A fully developed square channel flow with a Reynolds num- 
ber of Re = 4410 (based on bulk velocity and duct width) has been calcu- 
lated using Large Eddy Simulation (LES) technique with the Smagorin- 
sky eddy viscosity model. Results for the Prandtl’s secondary motion 
which is turbulence-driven show good qualitative picture and are in good 
quantitative agreement with values from other authors. Different nume- 
rical aspects have been investigated: the size of the numerical grid, the 
spatial discretization scheme for convection, the time discretization with 
first- and second-order implicit schemes. The accuracy of the results as 
well as the resources required for all cases studied are compared and 
discussed in detail. 

1 Introduction 

Continuing developments in numerical methods and in computer hardware al- 
low resource-intensive turbulence models like Large Eddy Simulation (LES) to 
develop from pure research activities toward a reliable engineering technology. 
Such a process requires the use of well-balanced between each other advanced 
numerical techniques like parallel algorithms, spatial and temporal numerical 
discretization schemes of higher order which beside their better formal accu- 
racy should fulfill additional requirements like, e.g., being non-dissipative and 
non-dispersive, see [1]. 

However, when using combinations of advanced numerical techniques the 
question of how resource-intensive such techniques are, becomes important. An 
investigation of the resources for LES computations was performed by [4] for 
the channel flow between two parallel plates. In their investigation the authors 
point out that only few systematic investigations about numerical aspects of LES 
exist. The reason they see in the large computational requirements for LES. 

The present study deals with the required resources for LES on the example 
of a square channel flow. This physical problem possesses a quite typical and 
sensitive to the numerical modelling secondary flow, which flow was used as a 
test of the accuracy for the different numerical aspects studied. 
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The present paper describes first details of the numerical method used which 
form the necessary basis for the consequent result analysis. Then the focus is set 
on aspects like grid resolution, spatial accuracy of convective schemes, temporal 
accuracy of implicit time schemes, procedure for averaging of the instantaneous 
flow parameters and CPU-time requirements for a parallel algorithm running on 
a Linux PC-cluster. 



2 The LES Model and the Boundary Conditions 

The widely used subgrid model of Smagorinsky has been applied in the present 
flow investigation. The Smagorinsky constant has a value 0.1. Reduction of the 
subgrid length in the proximity of the channel walls was made using Van-Driest~ 
damping function [3,1]. 



Wall Boundary Conditions. The wall functions of Werner and Wengle [9] 
have been implemented and utilized. This approach assumes that the instanta- 
neous velocity component which is parallel to the wall coincides in phase/time 
with the instantaneous wall shear stress. Thus there is no need in averaging in 
time and the value for the wall shear stress is obtained at each time-step without 
iterations from the local flow conditions near the wall. 



Periodic Boundary Conditions. Periodic boundary conditions were used to 
submit the values of the three velocity components and the turbulent viscosity 
from the outflow boundary (plane) toward the inlet boundary of the compu- 
tational domain. Before submitting the values, the algorithm first corrects the 
velocities in the outflow plane so that the continuity equation (and therefore 
the global mass flow rate) is satisfied - this is made by using a correction factor 
which is constant for all points in this plane. The correction is performed after 
each SIMPLE iteration. Such an algorithm guarantees the satisfaction of the 
continuity equation without the need for additional algorithmic developments as 
those discussed in [1] - e.g. adding a pressure-drop term or using a forcing-term 
in the momentum equation along the channel. 



3 Details of the Investigation 

3.1 The Physical Problem Studied 

The fully developed air flow in a square channel with dimensions 0.25 x 0.25m 
was studied. The length of the channel was 0.6m, or 2.4 times the channel width 
(this lenght was found to be sufficiently large by comparison with a case in 
which the channel length was 6.0m but further details go beyound the scope of 
the present paper). The average velocity trough the cross section (bulk velocity) 
was 0.2704m/s which corresponds to a Reynolds number (based on channel 
width) of 4410. 
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3.2 The Computer Code, the Numerical Grid 
aud the Parallel Machiue 

The Mistral/PartFlow-3D code [5,8] was used for the computations. The code 
is based on the finite volume approach, implicit time steps, the SIMPLE algo- 
rithm for velocity-pressure coupling and second-order central-differencing scheme 
(CDS) for convection. No use was made of a multigrid algorithm in the present 
study so that there is additional potential for further increase in efficiency of the 
computations. 

The rectangular grid used is cell-centered and consists of 144 x 48 x 48 = 
331776 points (together with the boundary points the total number becomes 
365 000). The grid was equidistant along the length of the channel. In the cross- 
section a refined toward the wall symmetrical grid with aspect ratio 1.07 was 
used. The numerical grid was separated in 12 numerical blocks (each one con- 
sisting of 27 648 points). Each block was computed on a single processor (such 
a distribution of the blocks is convenient for exchange of the periodic boundary 
conditions - in this case the inlet and the exit planes of the computational do- 
main belong each to a single block and therefore only two processors need to 
exchange the information). 

Investigations presented in this paper have been performed on subclusters of 
12 processors of the Chemnitz Linux Cluster CLIC (528 Intel/Pentium III, 800 
MHz, 512 MB RAM per node, 2 x FastEthernet), see [10]. The calculations on 
the PC clusters were performed with the MPI distribution of LAM-MPI 6.3.5. 

The CPU-time for the investigation (93 000 time steps) was 122 hours. The 
parallel efficiency achieved was 0.78. On average, 3 iterations of the SIMPLE 
algorithm within a time step were performed. 

3.3 Time Steps, Averaging Procedure and Temporal Discretization 

Initially 3 000 consequently-decreasing time steps were performed in order to 
allow the channel flow to obtain a fully-developed state. The time step reached 
after the initial iterations was 0.01s real (physical) time; it was kept constant 
during the rest of the computations. Thus the CFL number which defines the 
relation between the temporal and spatial discretization accuracy was equal to 
0.8. This value is similar to the one usually used with explicit time methods, see 
e.g. [7]. 

The averaging process was started after the initial iterations and all mean 
and turbulent characteristics of the flow have been obtained after averaging over 
90 000 time steps (this is similar to the procedure of [2] where 100 000 time steps 
are used). 

With the above described time steps the total physical time for averaging 
was 900s and for this time the flow forwards 973 channel- widths. The averaging 
in the present investigation was done only with respect to time and no use was 
made of the homogeneous spatial direction along the channel. 

A second order accurate implicit time scheme was used in the investigation. 
For the case of uniform time steps the scheme is described by the following 
equation: 
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^ ^ - 2r + 

dt At 

A first-order accurate implicit Euler-backward time-scheme is also available in 
the code which was used for comparison with the scheme from equation (1), see 
the next chapter. 

4 Numerical Results and Analysis 

First, an investigation was made with the parameters set as described in the 
previous section; we will refer it further as case “standard” . Figure 1 (a) shows 
the secondary flows of the time-averaged flow in a cross-section of the channel 
for this case. Such flow patterns, called “Prandtl’s second kind of secondary 
motion” , occur only in turbulent flows of ducts with non-circular cross-sections 
and are turbulence-induced, see Breuer and Rodi [2]. The maximum secondary 
velocity appears at the diagonal bisector (the right upper corner in the figure) 
and its magnitude is exactly 1.50% of the bulk velocity. This value is somewhat 
smaller than the value of 2% reported in [2] and the value obtained by DNS in 
[6] which equals 1.9%. 



(a) (b) 




Fig. 1. (a) Secondary flows of the time-averaged flow field in a cross-section of the 

channel; (b) Instantaneous velocity vectors in a cross-section of the channel 



The secondary motion is much smaller than the turbulent velocity fluctua- 
tions in the channel and consequently it can be “detected” only after averaging 
over a sufficiently large number of time steps. In order to illustrate this, the 
instantaneous velocity vectors in a cross section are plotted in Figure 1 (b). As 
it can be seen in the figure, no flows toward the corners are available for this 
particular time step. The magnitude of the plotted vectors (calculated from the 
velocity components which lie in the plane of the figure) is 14% of the bulk 
velocity, i.e. an order of magnitude higher than the averaged secondary motion. 
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Despite the fact that the secondary motion is quite small, it influences con- 
siderably the mean flow fleld - this can be seen on Figure 2 (a). The isocontours 
of the mean streamwise velocity show a clear deviation toward the corners of 
the channel. Beside this, a slight violation of the symmetry is seen on the figure. 
This is most likely due to the averaging process - the time for averaging might 
be still not sufficient to achieve perfect symmetry even after 90000 time steps. 
In order to clarify the effect of averaging on the symmetry of the secondary flow 
motion, the picture resulting after averaging over 20000 time steps is shown in 
Figure 3 (a). 



(a) (b) 





Fig. 2. (a) Isocontours of the mean streamwise velocity in a cross-section of the chan- 
nel; (b) Secondary flows resulting on a two times coarser along y and z coordinates 
grid 



In order to clarify the influence of the space discretization scheme for convec- 
tion, a separate run was made with a first-order upwind discretization scheme. 
The result: no secondary motion was observed at all. The upwind scheme even 
damped out the turbulent motion after approx. 4000 iterations and the stream- 
wise velocity then exhibits an “laminar” velocity profile (with a maximum of 
1.78 times the bulk velocity in the middle of the channel). 

The influence of the time discretiztion scheme was also investigated. A first- 
order accurate implicit Euler-backward time-scheme was tested for comparison. 
The time step for this numerical test was set exactly equal to the time step of 
the second-order scheme (0.01s). Again, as in the case of upwind spatial dis- 
cretization scheme, no secondary motion was observed at all together with a 
“laminarisation” of the flow. A second solution with the first-order time-scheme 
was obtained - but now with a 10 times smaller time step, or, 0.001s real time. 
We will refer this case as “time_lst_ord” . The result from this solution (again 
averaging over 90000 time steps) is shown in Figure 3 (b). The maximum magni- 
tude of the secondary motion appears in the vertical wall bisector (vertical plane 
of symmetry) and is 4.3% of the bulk velocity. The magnitude of the secondary 
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D V 0.00 0.05 0.10 0.15 0.20 0.25 [m] D y 0.00 0.05 0.10 0.15 0.20 0.25 [m] 

Fig. 3. (a) Secondary flows averaged over 20 000 time-steps only (compare with Fig. 1 
(a)); (b) Secondary flows with first-order time-scheme and a 10 times smaller time-step 



motion in the right upper corner is 2.4% of the bulk velocity. However, as it can 
be seen in the figure, the flow patterns are still deviating from symmetry which 
means that the time of averaging in this investigated case was not sufficient. 

The influence of the grid density was also studied. Figure 2 (b) shows the 
results regarding the secondary motion on a two times coarser numerical grid 
in the cross section, consisting of (144 x 24 x 24 = 82944) numerical points. 
This case is refered as “min_cvs”. The observation of the instant values in the 
monitoring point during computations show that real turbulent oscillations of 
all calculated quantities were present only during about 50% of the time of 
the computations (but changing alternatively with periods of “laminarisation” ) . 
Consequently, after averaging, the maximum magnitude of the secondary motion 
is quite low - only 0.64% of the bulk velocity. Good symmetry is still not reached 
despite that both the time step and the number of iterations for averaging were 
the same as for the regular (finer) grid. 



Table 1. Comparison of the calculation time for the different cases 



case studied 


SIMPLE iterations 
per time-step 
[approx, average value] 


calculation 

time 

[hours] 


relative time 
compared to case 
“standard” [%] 


parallel efhciency 
of the calculations 
[-1 


standard 


3 


121.9 


100 


0.777 


min.cvs 


2 


26.7 


22 


0.812 


time_lst_ord 


2 


107.4 


88 


0.779 



Table 1 shows a comparison of the total time for the calculations (CPU-time 
-I- communication-time) for the investigated cases. As expected, the 4 times 
smaller amount of control volumes in case “min_cvs” requires approximately 4 
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times smaller time for the calculations. The 10 times smaller time step for case 
“time_lst_ord” leads to a smaller number of iterations per time-step and to a 
decrease of the calculation time. However, one should keep in mind that for this 
case the physical time for averaging was 10 times smaller, and, as shown above, 
the results are much less accurate indicating the need for a greater number of 
iterations with a possible further decrease in the chosen time-step. 

5 Conclusion 

Many numerical aspects for LES of the fully developed square channel flow have 
been investigated on an implicit time marching code. Accuracy of all aspects has 
been compared in respect to the obtained secondary flow motion. The following 
main results are obtained: 

Accuracy of the numerical scheme for convection. The second-order 
accurate Central Differencing Scheme showed good agreement for the mag- 
nitude of the secondary fluid motion with values from other authors. When 
first order Upwind Differencing Scheme (UDS) was used, no secondary flow 
was obtained together with a full damping of the turbulent oscillations; 

— Accuracy of the time discretization scheme. Second order time-scheme 
delivered good results with a time step of 0.01s. Such time step is suitable 
also for explicit time marching as for it the CFL number is equal to 0.8. 
First order accurate time scheme with the same time step has been found to 
behave as poor and nonphysical as the UDS. Even when a 10 times smaller 
time step was used, first order implicit Euler-backward time-scheme delivered 
less accurate results than the second-order scheme; 

— Accuracy of different grid resolutions. Results with a numerical grid 
which was two times coarser along the two axis which lie in the cross sec- 
tion of the channel showed less accuracy together with a damping of the 
turbulence oscillations during approx. 50% of the time. However, secondary 
motion was still obtained with this grid and the time of the calculations was 
reduced to 22% which means that investigations with only 82944 control 
volumes might be used for quick initial testing of LES; 

— Accuracy of different time for averaging. Differently long physical time 
(or, which is the same - different number of time steps) have been used to 
obtain the average values of the velocities and turbulent characteristics of the 
flow. Accurate results have been obtained only after a quite long averaging 
process - 90000 time steps equal to 900 seconds physical time (for this time 
the fluid passes a distance equal to 973 channel- widths) . 

The resources required on a parallel Linux PC-cluster are presented and 
discussed in detail for all studied cases. They present important information 
for the reader interested in planning and carrying out similar numerical studies 
exploring the power of LES. Computations of the order of 4 till 5 days on a 
PC-cluster of 12 computers allow presently LES to be more and more involved 
in industrial flow predictions. 
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Abstract. Optimal structural design of microstructured materials is 
one of the central issues of material science. The paper deals with the 
shape optimization of microcellular biomorphic silicon carbide ceramics 
produced from natural wood by biotemplating. Our purpose is to achieve 
an optimal performance of the new composite materials by solving a non- 
linear optimization problem under a set of equality and inequality con- 
straints. The microscopic geometric quantities serve as design parameters 
within the optimization procedure. Adaptive grid-refinement technique 
based on reliable and efficient a posteriori error estimators is applied in 
the microstructure to compute the homogenized elasticity coefficients. 
Some numerical results are included and discussed. 
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1 Introduction 

Biological materials exhibit a hierarchically designed composite morphology and 
unique mechanical properties. Natural grown materials like wood and cellulose 
fibres have recently become of interest for advanced processing of engineering 
materials. Carbon preforms derived from natural wood structures serve as tem- 
plates for preparation of microstructural designed materials. The tracheidal cells 
in wood form directly porous structures at the microlevel which are accessible for 
liquid or gaseous infiltration and chemical reaction processing. Biomorphic mi- 
crocellular silicon carbide (SiC) ceramics have been recently produced by biotem- 
plating methods (see [6]). The production process requires infiltration of liquid 
or gaseous silicon (Si) into the carbonized (at high temperature) wood templates. 
Due to their excellent structural-mechanical properties, the new ceramic com- 
posite materials have found a lot of technical applications, for instance, filter 
and catalyst carriers, heat insulation structures, medical implants, sensor tools, 
etc. 
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In this study, we are concerned with the optimal shape design of the new 
microstructured ceramics by using the homogenization modelling and mesh- 
adaptivity at the microlevel. The homogenization design method is well es- 
tablished in structural mechanics (cf., e.g., [2,4,5, 7, 8]) and recently success- 
fully applied to a variety of optimization problems. Section 2 comments on the 
computation of the homogenized elasticity tensor in the case of a stationary 
microstructure with homogeneous linear elastic constituents. We assume a peri- 
odical distribution of the microstructure with a geometrically simple tracheidal 
periodicity cell. Here, the optimal shape design of the biomorphic SiC ceramics 
is briefly discussed. The optimization is applied to the homogenized model un- 
der both equality and inequality constraints on the state variables and design 
parameters. In Section 3, we focus on the adaptive refinement method based on 
the Zienkiewicz-Zhu (see [11]) error estimator. Mesh-adaptive procedures and a 
posteriori error analysis have been recently widely used in many finite element 
simulations of computational engineering and scientific computing (cf., e.g., [1, 
3,9-11]). Numerical experiments given in the last section show the efficiency of 
our adaptive strategy and the reliability of the a posteriori error indicator. 

2 Optimal Shape Design Based on Homogenization 

We assume periodical distribution of the microstructure with a simple unit tra- 
cheidal periodicity cell Y (see Fig. 1) consisting of an outer layer of carbon, 
interior layer of SiC, a very thin layer of silicon dioxide (Si02), and a void. Ex- 
perimental data show that the SiC-ceramics are not stable under oxidizing con- 
ditions and they form a Si02-layer of thickness 100 nm whereas the typical size 
of the tracheidal cell is approximately 30-50 ^m in diameter. The Si02-coating 
of the interface between the SiC and the void can be done by a controlled oxida- 
tion process of the inner SiC-surfaces at 800-1200'’C in ambient atmosphere and 
can significantly improve the mechanical performance of the ceramic materials. 

To provide a macrscopic scale model of the biomorphic composites we have 
applied the homogenization approach which has recently become a well-estab- 
lished technique in structural mechanics to find an optimal design of microstruc- 
tured materials (cf., e.g., [2, 4, 5, 8, 7]). We consider the case of a stationary mi- 
crostructure with homogeneous isotropic linear elastic constituents and Hooke’s 




Fig. 1. The periodicity cell Y with 3 constituents and a void 
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law as the constitutive equation. Under the assumption for well separated macro- 
and micro-scales, the homogenization method based on double scale asymptotic 
expansion (we refer to [2,5,8] for details) results in the homogenized elasticity 
tensor i,j,k,l = 1,2, with 




The vector function with periodic components G P = is 

considered as a microscopic displacement field of the following elasticity cell- 
problem in a weak formulation 

= V4.evy, (2) 

where Uy = {</> € Hpgj.(y)} is the set of all admissible F-periodic displacements. 

For the macroscale model of the composite ceramic material we consider a 
suitable reference domain 17 C TZ^ which can carry given loads. The optimal 
performance of the biomorphic ceramics strongly depends on the exterior body 
force f and the surface traction t applied to a portion Ty C dil of the work- 
piece. We denote by u = (mi,M 2 )^ the displacement vector (state variable), by 
a = (ai, . . . , am)'^ the design parameters (the widths of the different material 
layers), and by J the objective functional to be minimized (e.g., mean compli- 
ance, bending strength, etc.). Three different materials (m = 3) are considered 
in the microstructure shown on Fig. 1. 

Our optimization problem has the form 

J(u,a) = inf J(v,/3), (3) 

v,/3 

subject to the following equality and inequality constraints on the state variables 
and the design parameters: 

2 

^ f q-(l3dx+ f t-(j)ds, (4) 

m 

g{a) := ^ Oi = C, Omin < a* < a^ax, 1 < z < m, (5) 

i=l 

where Omin = 0, Omax = 0.5, and C, 0 < O < 0.5, is a given constant. Note 
that ai = 0, 1 < i < m, corresponds to a complete void, C = 0.5 to a com- 
plete solid material, and the case 0 < 0^,0 < 0.5, 1 < z < m, corresponds 
to a microstructural porous composite with a void. Equation (4) refers to the 
homogenized equilibrium equation given in a weak form. 

The constrained minimization problem (3)-(5) is solved by the primal-dual 
Newton interior-point method. The interior-point aspect is taken care of by 
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coupling the inequality constraints in (5) by parametrized logarithmic barrier 
functions whereas the primal-dual aspect stems from coupling the equality con- 
straints by appropriate Lagrangian multipliers. The resulting saddle-point prob- 
lem is solved by the Newton method applied to the associated Karush-Kuhn- 
Tucker conditions featuring a line-search approach for the increments and a 
hierarchy of merit functions for convergence monitoring (see [7] ) . 

3 Adaptive Grid Refinement 

The main idea of the adaptive grid-refinement process is to find a mesh which 
is adjusted to the behavior of the solution, i.e., the mesh points have to be con- 
centrated only in those parts of the domain where the solution changes rapidly. 
For the parts where the solution has unessential changes the mesh is distributed 
coarsely. This process is usually done by a trial and error analysis. Note that 
a priori error estimates give information about the asymptotic error behavior 
and in the case of singularities do not always lead to satisfactory results. In 
the past twenty years, numerous studies have been devoted to an error control 
and efficient mesh-design based on some post-processing procedures (cf., e.g., 
[1,3,9-11]). A natural requirement for the a posteriori error estimate is to be 
less expensive than the computation of the numerical solution. The a posteriori 
adaptive strategy can be described as follows: 

1. Start with an initial coarse mesh To fitting the domain geometry. Set n := 0. 

2. Compute the discrete solution on %i. 

3. Use a posteriori error indicator for each element T £ 

4. If the global error is small enough, then stop. Otherwise, refine the marked 
elements, construct the next mesh Tn+i, set n := n+ 1, and return to step 2. 

We solve the linear elasticity equation (2) in the periodicity cell Y by adaptive 
finite element method based on the Zienkiewicz-Zhu (often called ZZ) error es- 
timator proposed in [11]. A detailed theoretical study of the ZZ-estimator for 
linear triangular elements and the Poisson equation can be found in [9]. The 
basic idea of the method consists in computing an improvement of the solution 
stress tensor by a post-processing and take the difference between this so-called 
recovered continuous stress and the discrete solution stress as an error estima- 
tor. The quality and the reliability of the a posteriori error estimator strongly 
depend on the approximation properties of the stress recovery technique and the 
accuracy of the recovered solution. 

Assume an initial coarse triangulation satisfying the conditions: i) any two 
triangles share either a common edge or a common vertex; ii) the triangulation 
is shape regular^ i.e., the ratio of the radius of the smallest circumscribed ball to 
that of the largest contained ball is bounded above by a constant independent 
of the partition. The adaptive mesh-generation technique essentially depends on 
the properties of the coarsest mesh. For more details about dealing with hanging 
nodes within the adaptive procedure and keeping conformity of the elements, we 
refer, for instance, to [10]. 
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Suppose that for a given level of refinement n the triangulation 7^ satisfies 
the conditions i)-ii). For fixed k,l = 1,2, denote by cr, ct, and cr*, respectively, 
the exact stress, the discrete finite element discontinuous stress, and the contin- 
uous recovered stress in equation (2) discretized on Tn. Originally, the recovered 
stress was defined in [ 11 ] by interpolating the discontinuous (over the elements) 
approximation a under the assumption to use the same shape functions as for 
the displacements. This smoothing procedure can be done by nodal averaging 
method or the L 2 -projection technique. The components of cr* are piecewise lin- 
ear and continuous. The computation of the global L 2 (h^)-pi'ojection is expensive 
and hence, as proposed in [ 11 ], one usually uses “lumping” of the mass matrix 
of the form 

E (6) 

TeYp ' 

where Yp C T is the union of all elements sharing vertex P. Thus, cr*{P) is 
simply a weighted average of a on the triangles belonging to Yp. In practice, 
local recovery estimators of the form tjt ■= ||cr* — are considered. The 

global estimator -qy := ^t) equivalent to the error ||cr — <T|]o,y 

(we refer to [9,11] for more details). Heuristically (as an error indicator), the 
continuous recovered solution a* is a better approximation to the exact stress cr 
than a. In our numerical experiments, we apply ( 6 ) only to the boundary grid 
points P G dY. For an arbitrary interior node, a* is computed by averaging the 
stresses at the elements that share the considered node. This procedure requires 
to solve a least-square problem to find an approximation of the stress at the 
corresponding vertex inside the domain Y . 

4 Numerical Experiments 

In this section, we present some results on the computation of the homogenized 
elasticty coefficients ( 1 ) which invokes the numerical solution of ( 2 ) with the 
microcell as the computational domain. Due to the equal solutions 
one has to solve three problems in the period Y to find (Problem 1), 
(Problem 2), and (Problem 3). Note that the problem (2) is subjected to 
periodic boundary conditions on the outer part of dY and Neumann boundary 
conditions on the inner part of dY where the void is located (see Fig.l). For 
simplicity, we consider a square hole inside the domain and notice that in this 
case E 2222 ~ ^^ 111 - 

We use conforming PI finite element approximations with respect to an 
appropriate initial triangulation of Y and adaptive mesh-refinement procedure 
based on the ZZ-error indicator. The main purpose of any adaptive algorithm 
is to get an optimal mesh (heuristically), i.e., to make the discretization er- 
rors equidistributed between elements. The following adaptive strategy has been 
used in our finite element code: mark for refinement those triangles T for which 
qp > 7 inaxr/g 7 -„ 0 < 7 < 1 (see Section 3). In our numerical experiments, 
we choose 7 = 0.5 as a threshold. The marked elements are refined by a bisec- 
tion through the marked edge. The problem (2) is solved by the preconditioned 
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Fig. 2. Homogenized coefficients w.r.t. the widths of carbon and SiC layers: 

a) hole with a weak material; b) hole with no material 




Fig. 3. Problem 1, early wood, density = 51%, 9 adaptive refinement levels: a) carbon, 
1382 triangles, 737 nodes; b) carbon and SiC, 3754 triangles, 1919 nodes 



conjugate gradient (PCG) method with incomplete Cholesky factorization as a 
preconditioner. Plane stresses are assumed to compute the homogenized elastic- 
ity coefficients (1). The Young modulus E (in GPa) and the Poisson ratio v of our 
three materials are, respectively, E = 10, v = 0.22 for carbon, E = 410, v = 0.14 
for SiG, and E = 70, v = 0.17 for Si02. The numerical examples considered in 
this section do not take into account the additional thin layer of Si 02 formed by 
posttreatment of the SiG-ceramics (after oxidation at high-temperature). 

On Fig. 2 we display the behavior of the homogenized coefficient versus 
the widths of the G and SiG layers which vary between 0 and 0.5. The first picture 

a) on this figure treats the void as a weak material with E = 0.01 and v = 0.45 
whereas b) concerns a complete void with no material. We compute the effective 
coefficients E^^^i only for a fixed number of values of the design parameters (e.g., 
20 X 20 grid as shown on Fig. 2) and then interpolate the values by splines. With 
regard to the homogenized state equation (4), this procedure results in having 
explicit formulas at hand for the gradients and the Hessian of the Lagrangian 
function needed in the optimization loop (see [7]). 

The mesh-adaptive process is visualized on Figures 3 and 4. We see that in 
case a) of one material available in the microstructure, an appropriate refinement 
is done around the corners where the hole with a complete void is located. In case 

b) of more materials, additional mesh-adaptivity is needed across the material 
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Fig. 4. Problem 1, late wood, density = 84%, 9 adaptive refinement levels: a) carbon, 
1527 triangles, 818 nodes; b) carbon and SiC, 3752 triangles, 1916 nodes 



Table 1. Homogenized coefficients (early wood) w.r.t. adaptive refinement level 



level 


Siiii 


Tp±i 

-^1122 


Tp±i 

-^1212 


nt / nn (Prob.1,2) 


nt / nn (Prob.3) 


1 


64.975 


7.664 


12.116 


168 / 100 


168 / 100 


2 


63.336 


6.642 


9.750 


220 / 126 


224 / 128 


3 


58.466 


6.682 


8.073 


288 / 162 


300 / 168 


4 


56.572 


7.012 


6.643 


484 / 262 


576 / 308 


5 


54.385 


6.245 


6.212 


712 / 378 


760 / 402 


6 


52.936 


6.091 


5.474 


1208 / 630 


1376 / 716 


7 


51.914 


5.458 


5.306 


1800 / 932 


1800 / 930 


8 


50.861 


4.790 


5.217 


2809 / 1444 


2688 / 1374 


9 


50.455 


4.571 


5.029 


3754 / 1919 


3726 / 1896 


10 


49.591 


4.359 


4.983 


5918 / 3013 


5708 / 2896 



interfaces in the microstructure due to the strongly varying material properties 
(Young’s modulus and Poisson’s ratio). The ZZ a posteriori error estimator is 
local, not expensive and yields satisfactory results as the numerical experiments 
show. We also note that the density of the microcell depends on the growing state 
of the wood. In early wood regions (growth of the tree in spring and summer) the 
holes are large and the cell walls are thin (see Fig. 3, density 51%). For the late 
wood regions (autumn), the density of the tracheidal cells is larger compared to 
the early tree due to the smaller pores and the ticker cell walls, see Fig. 4. 

In Table 1 we give some results for the homogenized elasticty coefficients 
on various adaptive refinement levels in the case of early wood. We report the 
number of triangles nt and the number of nodes nn on each level when solving 
problems (2). The corresponding experimental data in the case of late wood are 
presented in Table 2. We see from both Tables that the mesh sensitivity on the 
successive levels is very small. Our adaptive mesh-refinement procedure stops 
when a priori given limit for the number of refinement levels is reached. 
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Table 2. Homogenized coefficients (late wood) w.r.t. adaptive refinement level 



level 


rpM 

^1111 


JpM 

-^1122 


ipti 

-^1212 


nt / nn (Prob.1,2) 


nt / nn (Prob.3) 


1 


33.430 


3.885 


9.893 


168 / 100 


168 / 100 


2 


33.064 


3.929 


9.577 


216 / 126 


224 / 128 


3 


32.844 


4.024 


9.283 


300 / 168 


284 / 160 


4 


32.291 


4.254 


8.970 


544 / 296 


520 / 280 


5 


32.144 


4.312 


8.809 


828 / 438 


668 / 356 


6 


31.909 


4.372 


8.703 


1354 / 705 


1064 / 556 


7 


31.862 


4.379 


8.526 


1892 / 980 


1484 / 768 


8 


31.735 


4.399 


8.470 


2894 / 1485 


2232 / 1142 


9 


31.711 


4.400 


8.373 


3752 / 1916 


3088 / 1576 


10 


31.487 


4.497 


8.321 


5716 / 2906 


4564 / 2316 
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Abstract. In this paper we discuss an efficient solution method for prob- 
lems of elastoplasticity. The phenomenon of plasticity is modeled by an 
additional term in the stress-strain relation, the evolution of this addi- 
tional term in time is described by the Prandtl-Reufi normality law. After 
discretizing the problem in time, we derive a dual formulation. Our so- 
lution algorithm is based on an equivalent minimization problem, which 
is presented for an isotropic hardening law. Since the objective is non- 
differentiable, we use a differentiable, piecewise quadratic regularization. 
The algorithm is a successive sub-space optimization method: In the first 
step, we solve a Schur-complement system for the displacement variable 
using a multigrid preconditioned conjugate gradient method. The second 
step, namely the minimization in the plastic part of the strain, is split 
into a large number of local optimization problems. Numerical tests show 
the fast convergence of the presented algorithm. 



1 Introduction 

The use of elastic material laws in mechanical models is often not sufficient in 
many real life applications. The phenomenon of plasticity can be described by 
an additional non-linear term in the stress-strain relation. Plasticity models have 
a long history in the engineering community. The interested reader is referred 
to the excellent monograph by Kachanov [7]. The rigorous mathematical and 
numerical analysis of different elastic-plastic models has been a topic of mathe- 
matical research during the last two decades, see e.g. [4,5,8] and the literature 
cited there. 

The admissible stresses are restricted by a yield function depending on the 
hardening of the material. Furthermore, this yield function characterizes the 
plastic behavior: isotropic hardening, kinematic hardening, visco-plasticity and 
perfect hardening. The Prandtl-Reufi normality law describes the time develop- 
ment. 

The starting point of the finite element method is the time-discretized vari- 
ational formulation. This dual formulation in each time step is equivalent to 
an optimization problem depending only on the displacement vector u and the 
plastic part of the strain p\ 

f{u,p) = min f{v,q), 

v^q 
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under an equality constraint, with / being a convex, non-differentiable func- 
tion with quadratic terms. Further on, only the case of plasticity with isotropic 
harding will be considered. A differentiable and piecewise quadratic objective is 
obtained by regularization of /, thus standard methods can be applied. 

The main idea for the algorithm is to use the Schur-complement form of 
the discretized problem in the displacement variable Uh- Thus the minimization 
problem reduces to 



where <7opt(^^») denotes the optimal plastic part of the strain with respect to 
Vh- The system matrix would depend nonlinear ly on p, therefore the problem is 
linearized in this variable. Correcting the error, the plastic part of the strain is 
determined locally by Newton’s method. 

The method presented in this talk is based on the approach proposed by C. 
Carstensen [2]. In contrast to [2], we introduce some regularization of the local 
minimization problems making the cost functional differentiable and allowing us 
to use the fast converging Newton method. Moreover, we use a special adapted 
to the problem, multigrid preconditioned conjugate gradient (PCG) method for 
the Schur-complement problems arising at each incremental step. 

The paper is organized as follows: In Section 2, we give a brief overview on 
the basic equations of elasticity and plasticity. First we derive a dual formula- 
tion and then an equivalent optimization problem for general hardening laws. 
Furthermore, only the case of isotropic hardening will be considered. Section 3 
is devoted to the construction of the algorithm. In Section 4, numerical studies 
show the fast convergence and the efficiency of the algorithm. Finally, an outlook 
on the work still to do is given. 

2 Elastoplasticity 

2.1 Definition of the Problem 

According to the basic theorem of Cauchy, the stress field cr G L^(l7, of 

a deformed body f2 in R" (n = 2, 3) with Lipschitz-continuous boundary has to 
fulfill the equations 



with b being the vector field of given body forces. The linearized Cauchy Green 
strain tensor is appropriate in the case of small deformations, and is obtained 
by using the displacement vector u G 



Moreover, in the case of small deformations the strain is split additively into two 
parts: 



Uh = a.rgmin^^f{vh,qopti'i’h)) 



a = in 17, 
— div a = b in 17, 



( 1 ) 

( 2 ) 




( 3 ) 



e{u) = Aa + p a. e. in 17. 



( 4 ) 
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Here, Act denotes the elastic, and p the plastic part. The linear, symmetric, 
positive definite mapping A from p^nxn describes the linear elasticity, 

thus C = A~^ is the elasticity tensor. 

Purely elastic material behavior is characterized by p = 0. The modeling 
of plasticity requires another material law in order to determine p. There are 
restrictions on the stress variables described by a dissipation functional p, which 
is convex, and non-negative, but may also attain -|-oo. The first restriction is 

(p(cr, a) < cxD a. e. in f2. (5) 

The hardening parameter a is the memory of the considered body and describes 
previous plastic deformations. Its structure and dimension depend on the hard- 
ening law. The above inequality indicates that a controls the set of admissible 
stresses. The pair (a, a) is called generalized stresses and values are called ad- 
missible if p{a,a) < oo. 

The time development of p and a is given by the Prandtl-Reufi normality 
law which states that for all other generalized stresses (r, /3) there holds: 

p \ {t — a) — a : {P — a) < p{t, /3) — p{a, a) a. e. in Q, (6) 

where p denotes the time derivative of p, i.e., p = ^, and : is the scalar product 
of matrices such that A \ B = for all A, B ^ R”^”. 

Now we are in the position to define the initial value problem: 

Problem 1. We look for the displacement field u G W^’^{0,T; iJg(f2)”), 
the plastic part of the strain p G T; R”^")), the stress field 

cr G and the hardening parameter a G 

L^{f2, R"*)), such that (1) - (6) are satisfied under the initial condition 6(0) = 0. 

The time dependent variational inequality (6) is solved by an implicit time dis- 
cretization, for example generalized midpoint rules like Crank-Nicholson or im- 
plicit Euler schemes. An implicit Euler discretization for the weak formulation 
leads to the next problem definition: 

Problem 2. Let H C {[2)^ , C and L”" C be 

closed subspaces. We seek (u,p,a,a) G H x x x L"*, such that for 

given Uf) G H; po,ao G and oq G L™ at some time step to the following 

conditions are satisfied for ti = to + At: 



/ a : e{v) dx = / bv dx Mv G PI, 

J n J n 

/ {{p - Po) '■ {t - a) - {a - ao) ■■ {P - a)} dx 

^ r r 

< At p{t, P) dx — / p(cr, a) dx, 
Jn J n 



( 7 ) 

( 8 ) 



for all (r,/3) G x 
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tV^yrn denotes real, symmetric n x n matrices. The inequality (8) reads as 



^{p- Po,-ot + ao) G dip{a,a). 



( 9 ) 



The sub-differential dip{b) is defined by the relation a G dip{b) a : {c — 

b) dx < (p{c) dx — ip{b) dx. Since ip is convex, the above equation is 

equivalent to: 

(cr,a) G ao))- (10) 

ip* is the dual functional of ip, which is computed by the Fenchel transformation: 
ip*{y) := sup,^{y : x-ip(x)}. 

Substituting a = C (e{u) — p) in (7) and (10) the simplified problem 3 can 
be obtained. 



Problem 3. Find a triple {u,p,a) G 77 x x T'", such that the following 

conditions are satisfied for all (v,q,f3) G H x x T™: 



/ C[e{u) — p\ \ e{v)dx = / bvdx, 

J Q J Q 

J {C[e(u) — p] : {At q — p + po) + a : {—At(3 -I- «o — a)}dx < 
< At [ (p*{q,P)dx - At [ p*{ ^ f° , /^ )dx. 



( 11 ) 

( 12 ) 



Problem 3 is the stationary condition of a minimizer in the following mini- 
mization problem 4. Vice versa, a minimizer of / of Problem 4 is a solution of 



3: 



Problem 4- Find the minimizer {u,p,a) G 77 x x 7^"* of 

f{u,p,a) ^ / C[e{u) - p] : {e{u) - p)dx + ^ [ \a\'^dx 

_ X Jf2 



At 



In 



*fP-Po Q^o-q:n, 



At 



b u dx. 



(13) 



2.2 Isotropic Hardening 

In the case of isotopic harding the space dimension m of the hardening param- 
eter a is 1, i.e., a is a scalar function. The dual functional of the dissipational 
functional tp can be computed as follows (see [2]): 

W*(A R1 = / trA = 0 A {(JyH\A\ +B)<Q 

^ \oo liirA^Qy {ayH\A\+B)>Q 

with the two arguments A = and B = . The minimization of (13) with 

respect to a affects only the term japdo: under the restriction ((Ty77|A|-|-77) < 

0. The unique solution is a = + OyH\p — pa\. So we obtain a simplified 

minimization problem: 
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Problem 5. Find the minimizer (u^p) of 




<^y\P - Po\dx 




L 



budx 



n 



{ao + (jyH\p - pq\) 



^dx 



(14) 



under the constraint tr(p — po) = 0. 

The restriction deduces from trA = 0. The uniqueness of the minimizer follows 
from the properties of the dual functional, see [3]. 

3 Algorithm 

Time and space discretizations are needed to describe the mathematical model 
numerically. For the numerical tests only equidistant time intervals will be used. 
The notation is the same as for the time discretized problems of section 2: For 
given variables (with index 0) of an initial time step to, the upgrades of the 
variables at the time step ti = t + At have to be determined. The basic idea 
for solving the quasi-static problem is using a uniform time discretization and 
iterate in each time step until the minimizers, i.e., the displacement u and the 
plastic part of the strain p, are determined. Then these values and the separately 
calculated a are used as the reference values with index 0 for the next time step 
t2- 

If a function is quadratic, then the minimum can be computed easily, e.g. by 
Newton’s method. Unfortunately, (14) is not. The matrix C is symmetric and 
positive definite, thus C[e(M) —p] : (e(u)—p) behaves quadratically in (s(u) —p). 
The second term is quadratic in p, since po (the result of the previous time step) 
and Oq are considered as constants. The last term behaves linearly in u, so it 
adds to the right hand side of the corresponding system of equations, see (19). 
The only term not behaving quadratic, but rather problematic, is the third one 
containing an absolute value the sharp bend of which may cause trouble. 

The term is regularized by smoothing the absolute value function as follows: 



For e being small, the function f(u,p) is very similar to the original one, but 
its properties change enormously. Therefore, it will be referred to by the new 
symbol /. 

Another simplification is defining the change oi phy p = p — p^, and using it 
as an argument of the objective instead of p: 




(15) 



f{u,p) ■■ 





C[£(m) — p — Po] : (e(m) — P — Po) dx — budx 




I 

J n 



n 



( 16 ) 
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Now the spatial discretization is carried out by the standard finite element 
method (see e.g. [1]) using quadratic tetrahedral finite elements. For reasons of 
better readability and coherence, the name of the vector denoting the discretized 
displacement u is again u. The same is valid for p, pq, but furthermore the 
matrices are transformed to vectors, e.g. in 2D 



f Pll Pl2\ 
V P22 P22 J 




such that the objective and other equations can be written in a matrix and 
vector notation. Now, the objective is equivalent to 

^{Bu - pfC{Bu -p) + ^p^U{\pU)p + {-B^Cpo - bfu (17) 

under the constraint tr p = 0. Here, Bu denotes the discretized strain e{u). H is 
the Hessian of the discretized objective with respect to p and it depends on jpl^. 
In order to gain a linear system of equations, the Hessian is computed in every 
iteration step using the current p, but apart from this the dependence on \p\^ 
will be neglected. This is not an exact method to determine the change of the 
strain, but the error will be corrected later on as p will be computed separately. 

Since the constraint tr p = 0 is linear, i.e., in 2D: P22 = —pii, in 3D: P33 = 
— Pll — P22, it is equivalent to project the problem with the matrix P onto a 
hyperplane, where the constraint is satisfied exactly: p = Pp with 



2D: P = 







/ 1 


0 0 0 0\ 


1 0\ 
-1 0 
0 1) 




0 


1 000 


3D: P = 


-1 

0 


-10 0 0 
0 10 0 




0 


0 0 10 






^ 0 


0 0 0 1/ 



In matrix notation the minimization problem (17) with the new variable p now 
reads as 

IfuYf b'^cb -B'^CP 
2 \p) l^-P^CHP^(C + H)P 

( 18 ) 

The above matrix is positive definite, thus the minimizer (m, Pp) has to fulfill 
the necessary condition of the derivative being equal to zero: 



-b-B^Cpo 

P^Cpo 



f B^CB -B^CP \ fu\ ^f-b-B^Cpo\ 
[-pTcB P^{C + B.)Pj \p) ^ \ P^Cpo ) 



(19) 



Extracting p from the second line and inserting it into the first one yields the 
Schur-Complement system in u: 



B^{C - CP{P^{C + H)P)-^P’^C)Bu = 
-b - B^{C + CP(P^(C + H)P)-ip^C)po 



( 20 ) 
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This linear system is solved by a multigrid preconditioned conjugate gradient 
method. Multigrid preconditoners were studied by [6]. 

The minimization in p can be done locally for each element, as no connections 
over several elements (e.g. derivatives) occur. For minimizing the function / in 
p, all the terms depending only on u become redundant. The remaining function, 
called F, becomes 

F{p) = Cp + Cp - p^Ce{u) + ^a^H^lPl'^ + ay {1 + aoH)\p\^. (21) 

Then, p is determined by a modified Newton’s Method, as again the constraint 
has to be considered. The local Newton system to determine the search direction 
Ap writes as 

P^F”{P)PAp= -P^F'iP) (22) 



4 Numerical Results 

The algorithm was implemented in NGSolve - the finite element solver exten- 
sion pack of the mesh generation tool Netgen^ developed in our group. As finite 
element basis functions on the triangular, resp. tetrahedral elements we chose 
piecewise quadratic functions for u and piecewise constant functions for p. Fur- 
thermore, the full multigrid method was used, i.e., we started with a coarse grid, 
solved the problem, refined the grid, solved the problem on the finer grid et 
cetera. 

The testing geometry is the half of a three-dimensional ring, see left Figure 1 
for a 2D-sketch. The problem is symmetric, so considering only the upper quar- 





Fig. 1. Halfring, reduced problem and plasticity domain (darkgrey) 



ter, see middle Figure 1, under symmetry boundary conditions is sufficient. The 
material constants are chosen as E = 1, v = 0.2, H = 0.01, ay = 1 and the force 
working on the right edge is F = 0.25 (to fulfill the safe-load assumption). The 
inner quarter radius is 1, the outer radius is 2, and the thickness is 1. 

The right Figure 1 shows the domains where the material has plastified. 
Figure 2 shows the linear complexity of the algorithm. 

^ Download at http://www.sfb013.uni-linz.ac.at/~joachim/netgen 
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Fig. 2. Linear complexity of algorithm (Dofs versus time) 



5 Conclusions 

In this paper the theory of elastoplasticity has been combined with the nested 
iteration approach and a multigrid preconditioned conjugate gradient solver. A 
quasi-static algorithm solving the case of isotropic hardening in 3D has been 
created, implemented and tested. The numerical results demonstrate the fast 
algorithm performance with linear complexity. 

Our future work will concentrate on implementing the algorithm for other 
hardening laws, and doing numerical analysis to prove the convergence observed 
in the numerical example. 
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Abstract. Water droplets on insulating material influence strongly the 
aging process of the material and the material looses its hydrophobic 
and insulating properties. The shape of the droplets signifies the state of 
the aging material. The present paper discusses a numerical procedure 
to calculate the droplet shape in an electric field generated by constant 
voltage between two electrodes. This leads to a transmission problem on 
the free surface of the droplet. It contains two sub-problems, first finding 
the droplet shape in a given electric field via an evolution problem, and 
second calculating the electric field for a given droplet shape via finite 
elements. The force density acting on the droplet shape depends on the 
electric field, and the electric field depends on the droplet shape. Both 
sub-problems have to be solved simultaneously. The typical shapes of the 
droplets are shown for several voltages. Finally a comparison between 
the behaviours of a dielectric droplet of pure rainwater and a conductive 
droplet of water with environmental additives is presented. 



1 Introduction 

Water droplets on insulating material influence strongly the aging process of the 
material and the material looses its hydrophobic and insulating properties. This 
process causes considerable problems in the maintainance of high voltage power 
lines. The shape of the droplets signifies the state of the aging material, [6]. A 
first step to understand the effects of the aging process to the droplet shapes, is 
the determination of the droplet behaviour within electric fields. 

The present paper considers a single droplet in an electric held as it has been 
used in the experiments in [6] . The water droplet lays on a solid support made of 
resin, and the electric held is generated by a voltage between two electrodes inside 
the resin. If a voltage is applied, the droplet becomes lengthened and flattened. 
In particular, it looses axis-symmetry. But the electric held is depending on the 
droplet shape too and on the conductive properties of the water. This feed-back 
leads to a free boundary value problem, comp. [13]. 

In [10] and [12] the problem was numerically handled by finite integration 
techniques, comp. [11], on a three-dimensional equidistant grid in a harmonic 
quasi-stationary formulation. Here we concentrate in particular on the variable 
shapes of the droplets. Thus, finite elements on a triangulation adapted to the 
droplet shape are used for the calculation of the electric potential. 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 387-395, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The realistic situation of a three-dimensional droplet in an alternating electric 
field involves various effects like material flux inside the droplet fluid, inertial 
effects, induced currents in the fiuid and so on. All these effects are regarded 
to be rather small, see [10]. Hence, only stationary electric fields are considered 
here. 

Further, the restriction to a two-dimensional droplet does not avoid any 
specific difficulty. The handling of the free boundary value problem by a Banach- 
like iteration proposed in [2] can be transfered to the three-dimensional droplet as 
well with some technical effort but without fundamental changes of the numerical 
procedure. Here, we focus on the numerics of the free boundary value problem 
of a droplet in a stationary electric field consisting in an iteration alternating 
between two sub-problems : Those are an evolution problem for calculating the 
droplet shape, see Sec. 3 and a finite element computation for determining the 
electric field, see Sec. 4. 

The main ideas of the iteration are presented in Sec. 2 together with a detailed 
model set-up. In Sec. 3, the equilibrium of forces on the upper boundary of the 
droplet and in the triple points is discussed. The droplet shape is computed via 
an evolution problem modeling a hypothetical transient motion of the droplet. 
Sec. 4 presents the details in the computation of the electric potential and of the 
force density on the droplet boundary caused by the electric field. 

In Sec. 5, we compare the dielectric and the conductive droplet. On the one 
hand, pure rainwater is nearly not conductive. On the other hand, environmental 
additives pollute the water. It gets conductive. The droplet shapes are given for 
both situations and various voltages. We will shortly present the further ideas 
to handle a droplet of rainwater with a very small limited amount of additives, 
i. e. the case that the fiuid is an imperfect conductor. 

2 Problem Set-Up 

2.1 Model Description and Notations 

We investigate a two-dimensional water droplet in a stationary electric field. The 
droplet is described by its upper boundary F which is given in polar co-ordinates 
with the origin O, i. e. F = {xr = {r((p) cos (p, r{ip) sin (p) ;ip € [0, tt]}. The angle 
p serves as parameter of the points xr{p) at the boundary. The droplet fiuid is 
in fact rainwater with the mass density 6 = 1000 kgm“^ and the surface tension 
a = 0.072 Nm“^. The surface tension does not depend on the electric field here. 
The voltage generating the electric field is 2U. The boundary of the electrodes 
is called A, see Fig. 1. The points A and B are used in Sec. 3.1. The positions 
of the electrodes inside the resin influence obviously the occurring numerical 
values. 

The dielectricity of rainwater is £h 2 0 = 81 and the dielectricity of the solid 
support made of resin is £2 = 4. The dielectricity of the the air is £1 = 1.0006 « 1, 
what is an acceptable approximation for our purposes. 

The electric field causes a concentration p of electric charge at the boundary 
F of the droplet. In the case of a dielectric droplet, this charge is effected by 
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Fig. 1. Left : Model set-up of the problem. The droplet lays in the centre on the surface 
of the solid support containing two electrodes. Right : Evolution of the radius in the 
evolution problem starting with r{ip) = const, and setting pe = 0 . The solid lines 
mark the initial and final shape of the droplet. The dashed lines give the radius at the 
auxiliary time-instants t = 2 - 10 “® s, t = 6 • 10 “® s and t = 10 • 10 “® s (from above). The 
height is four times exaggerated, cf. Fig. 2 for a droplet with low distortion in scale. 



polarization, and in the case of a conductive droplet, there is free electric charge 
at the boundary. This distinction is discussed in Sec. 4.1. The concentration of 
electric charge at F results in a force density Pe(¥>) = Pe{xr{^)) there. 



2.2 Fundamental Iteration 

The free boundary value problem consists in two sub-problems. The first sub- 
problem of finding the droplet shape described by r{if) for a given force density 
Pe{^) is called 7?,-problem. The second sub-problem of finding pe for a given r 
is referred to as T’-problem. There are two operators TZ and V mapping a force 
density Pe to a resulting r and vice versa, a given r to a force density pe, i. e. 

TZ : Pe —f r and 'P : r pe . 

While solving the problem of a stationary droplet in the electric field, we are 
searching for a fixed point of the combined operator TZ'P, i. e. r^x = TZ'Prax. 

To separate both sub-problems we state the Banach-like iteration 

= ujTZVr^^ + (1 - ( 1 ) 

with the relaxation parameter uj. The iteration (1) is a standard procedure in the 
solution of free boundary value problems, cf. [13]. It was successfully applied for 
the closed osmometer problem in [2]. The basic ideas of the convergence proof 
from [2] can be transfered to the present problem. So, we assume 

lim = rgx (point-wise) (2) 

Z— ^OO 

at least for suitable uj and the tests presented in Sec. 5 show that (2) holds. 
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3 The 7?^-Problem of Calculating the Droplet Shape 



3.1 The Equilibrium of Forces at the Droplet Boundary 

The forces acting in xr{^p) with G (0, tt) at the upper boundary of the droplet 
in the direction of the outer normal n on T are the capillary pressure pk, the 
pressure ph depending on the depth of water, a pressure po serving as penalty 
parameter to hold the droplet volume constant and the outer force density pe 
causes by the electric field. The droplet is stationary, if the equilibrium of forces 



0 =Pk(p) +Ph(p) +Po('^) +Pe(p) 



( 3 ) 



holds in every point xr{'-p) with tp G (0, tt) except in the corners P = xr(0) and 
P' = a;r(7r). The capillary pressure is given by 



Pk(<p) = —a ■ k(p) with k(<p) 



r(p)^ + 2r'(p)^ — r(p)r"(<p) 



depending on the curvature k of the droplet surface, comp. [4]. The pressure 
inside the droplet depending on the depth of the water is 



Ph(p) = Sg{h — r(p) sin p) with h = max r(p) sin p . 

ipG[0,Tr] 

The penalty pressure po acts against a volume dilatation and is expressed by 

Po = Impair with Vr = ^J r{pY dp (4) 

with an assumed volume V of the undeformed droplet fluid under the air pressure 
Pair. Equation (4) replaces the constraint condition of incompressibility in the 
present formalism of equilibrated forces. 

In the corner P the surface tension a acts between the fluid and the air 
in the direction PB. The boundary tension aip is the difference in first the 
surface tension between the solid support and the fluid in the direction PA and 
second the small boundary tension between the solid and the air. This leads to 
a constant angle between the droplet shape and the support, comp. [4]. We use 
the angle d = 1.1 and state in the corners P and P' the equations 

tani? • r'(0) + T-(O) = 0 and tan-d • /(tt) — r(7r) = 0 . (5) 

In [8] investigations of the behaviour of the electric potential near the corner 
points, [9], show that additional forces to disturb the equilibrium of forces in the 
corners, do not occur within the present problem. 



3.2 Numerics of the Evolution Problem 

Equation (3) with the boundary conditions (5) is an ordinary non-linear elliptic 
boundary value problem for r{ip) on p G [0, tt], comp. [3]. It can be solved by 
computing the respective parabolic problem with an auxiliary time t. That is 

d 

-^r{p, t) = Pk{p) + Ph(p) + Po{<p) + Pe{<p) , 



( 6 ) 
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d d 

— r(0,t) = /c[tani?- r'(0, t)+r{0, t)] and = — fc[tani?- r'(7r, t) — r(7r, t)] 

* . . . ^ 
where r\-,t) denotes the derivative with respect to the first, local parameter ip, 

and fc is a suitable amplification factor, e. g. k = The factor k equilibrates 

the evolution speed of the droplet boundary inside the interval (p G (0,7r) and 

on its ends and makes the solution of (6) numerically more convenient. 

Equations (6-7) can be discretized over ip, see [1], and solved by standard 

methods for stiff ordinary differential equations, [5], like e. g. odelBs in Matlab. 

In the examples of Sec. 5 the interval [0, tt] was discretized into 200 equidistant 

grid points. Here a time-constant radius is reached in less than 1-10“^ s, cf. Fig. 1. 



4 The P-Problem of Calculating the Electric Potential 



4.1 The Electric Potential and the Electric Force Density 

In the first case of a dielectric droplet, the support of the electric field 
is whole the plane outside the electrodes. In the second case of a conductive 
droplet, the droplet is free of an electric field, and is the plane outside 
the electrodes and outside the droplet. Furthermore the domains are each free 
of electric charge and the potentials are vanishing at the boundaries and 

dl7^^^ . The fact that the potential vanishes on the boundary of a conductive 
droplet follows from the symmetry of the problem. It holds (x) = 

We get two boundary value problems, cf. [7], for f = 1, 2 

—V • [£r(x)V<?^®^(a;)] =0 in x G 17^*^ , 

<?W(x)=0 onxGaf2(*)\ri, (8) 

= sign(xi) -U on X £ El . 

The position depending dielectricity e(x) with x = (xi,X 2 )^ is £i in the air, £2 
in the support and £h 20 inside the droplet, s. Sec. 2. Problem (8) is linear in U. 

In the case of a conductive droplet, we assume temporarily free electric charge 
p{x) = £0 (x) in a limit layer on the droplet boundary, and the force density 

acting on the droplet boundary is found by integration in normal direction 

^ 2 
= i™ J p{^r + crn)E^^'>{xr + on) do = ^£0 



corresponding to [4] and [12]. 

In the case of a dielectric droplet, the normal component of the electric 
field jumps at the boundary of the droplet and causes a concentration of 
polarized charge pp(x) = £qV • E^'^\x). The force density acting on the droplet 
boundary is found similarly to (9) by 






' fd<P<-^\xr)V 


/'d<P<-^\xr)V' 


dn+ ) 


\ dn~ J 



( 10 ) 



392 Dirk Langemann 



with the normal derivatives d/dn^ outside the droplet and d/dn~ inside the 
droplet. Hence, the force density Pe can be calculated in both cases depending 
on the droplet shape which determines the domain resp. the dielectricity 
e{x) in Since the potential is vanishing inside the conductive droplet, 
(9) and (10) express identical facts although the physical background is well 
different. 



4.2 Finite Elements and Numerical Treatment 

Using the symmetry, we consider (8) only in the right half-plane. The half-plane 
is restricted by a rectangle. The error due to the restriction is regarded to be 
small because of the local character of any disturbance of the electric field. 

The boundary value problem (8) with its Dirichlet boundary conditions is 
solved by standard finite elements on a triangulation, which is refined near the 
boundary of the droplet, cf. Fig. 2. It is created once in a pre-processing and 
adapted to each actual shape of the droplet. 




Fig. 2. Details of the triangulations, left : dielectric droplet, right : conductive droplet. 
Both contain a droplet nnder the absence of an electric field. 



In the presented examples triangulations with 9011 points and about 17600 
triangles were used. That leads to « 50 triangles sharing a side with the dis- 
cretized boundary of the droplet. The resulting system of linear equations is 
solved by standard procedures for sparse matrices. The solution was checked 
with the ones on different triangulations and regarded to be just fine. 



5 Comparison of Dielectric and Conductive Droplets 

The iteration (1) to find a fixed-point of the operator TZV works rather fast. For 
low voltages U < 10 kV, it needs about five steps with w = 1. Higher voltages 
require a smaller relaxation parameter and needed up to 20 steps. One step on the 
triangulation given in Sec. 4.2 takes about 90s on a 2 GHz machine. It contains 
the solution of the Dirichlet problem (8) on the pre-processed triangulation for 
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Fig. 3. Potentials (details) near the droplet for U — 8 kV, left : dielectric, right : con- 
ductive droplet. Refraction is remarkable at the interface of the support and the air. 



the shape given by calculating pe, solving the evolution problem (6) to find 
7 .(®+i) and adapting the triangulation to the new droplet shape. 

The potentials and are presented in Fig. 3. The difference between 

them is small, but is not zero inside the dielectric droplet. On the other 
hand, the droplet shapes differ very well for larger voltages. Fig. 4 shows typical 
droplet shapes for both cases. 





Fig. 4. Droplet shapes for U = OkV, U = 8kV, U = 16 kV, U — 24 kV and U = 
32 kV (solid lines, from above), left: dielectric droplet, right: conductive droplet. The 
additional dashed line shows the droplet shape for U — 40 kV. Vertical scale wS:!, 
horizontal scale « 16:1, i. e. two times exaggerated height. 



An example of a droplet shape in a very strong electric field, here for U = 
40 kV is given in Fig. 4. The droplet has become lengthened and flattened so 
clearly, that a local height minimum is remarkable in xi = 0. For even higher 
voltages the droplet will tear up. The numerical values of the voltages depend 
on the positions of the electrodes. For instance, electrodes at larger distances 
from the droplet require higher voltages to produce the same effects. 

Fig. 5 gives a closer comparison between the dielectric and the conductive 
droplet. The conductive droplet becomes more lengthened and more flattened for 
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the same voltage. The difference has an essential magnitude for strong electric 
fields. 

The shapes of the droplets are identical for U and —U. This makes clear why 
the droplets oscillate with double the frequency of a slowly alternating voltage. 





Fig. 5. Left: Droplet shapes for U = 16 kV and U = 32 kV (dashed lines: dielectric, 
solid lines : conductive droplet). Right : Droplet heights depending on the voltage U. 



The assumptions of a purely dielectric and a purely conductive droplet are 
idealizations. A realistic droplet of rainwater contains a limited amount of free 
electric charge. For low voltages, it behaves like a purely conductive droplet. For 
higher voltages, all free electric charge will be concentrated on the boundary. 
Nevertheless, it does not suffice to assure the absence of an electric field inside the 
droplet for higher voltages. The droplet behaves more and more like a dielectric 
one. 

To handle this situation numerically, a further evolution equation for the flux 
of the free electric charge on the boundary of the droplet has to be solved under 
the constraint condition of a limited amount of free electric charge. Instead of the 
stationary problem (8), a quasi-stationary problem including charge flux and a 
dependence of the conductivity on the available electric charge has to be solved. 



6 Conclusion 

The free boundary value problem of determining the shape of a rainwater droplet 
in a stationary electric field was handle by a Banach-like iteration over two sub- 
problems. The first sub-problem is an evolution problem to find the droplet shape 
for a given outer force density. The second sub-problem consists in finding the 
electric potential and thus the force density. It was computed by finite elements. 

We resume that the proposed numerical procedure works fine in the two- 
dimensional case without further sophistication. An extension to the three- 
dimensional case is actually in work and requires technical effort on the base 
of the same ideas. Thus, the numerical techniques to calculate the shapes of 
droplets on insulating material are presented and tested. 
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The difference between the shapes of dielectric and conductive droplets is 

remarkable. A discussion of experimental results should include this material 

property. 
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Abstract. The hydrodynamical properties of the magnetic fluid in mag- 
netic fluid seals are described by a coupled system of nonlinear partial 
differential equations in a three-dimensional domain with free bound- 
aries. We propose a reduction of the three-dimensional model and de- 
scribe the resulting two subproblems, the calculation of new boundaries 
for given flow and magnetic data, and the computation of the flow in a 
fixed domain, which have to be solved in an iterative manner. We con- 
sider in detail the finite element solving strategy for the flow part, which 
builds the main effort within the overall algorithm and show the results 
of a numerical test example. 

1 Introduction 

Magnetic fluid seals are designed to isolate hermetically a high from low pressure 
environment. This is realized by bringing a drop of ferrofluid into the gap between 
a magnet and a high permeable rotating shaft. In Fig. 1 such a rotary shaft seal 
is shown, where 1 denotes the magnet, 2 the core, 3 the concentrator, 4 the shaft 
(rotor), and 5 the magnetic fluid bridge. 



In this paper we describe the underlying equations for modelling the be- 
haviour of the fluid drop in a magnetic fluid seal, present a simplified model 

I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 396-403, 2004. 
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Fig. 1. Schematic representation of a rotary shaft seal 
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and explain the iterative procedure to determine the free boundaries and the 
fluid flow (Sec. 2). The main focus is on the efficient solution of the discretized 
equations for the hydrodynamic part, which is the most costly part of the overall 
algorithm (Sec. 3 and Sec. 4). Finally, we present numerical results for a test 
problem of simulating the flow in magnetic fluid rotary shaft seal (Sec. 5). 

2 Mathematical Modelling of a Rotary Shaft Seal 

2.1 Basic Mathematical Model 

Starting with the Maxwell equations applied to a nonconducting fluid the mag- 
netostatic problem is given by 

VxH = 0, V-B = 0 (1) 

with the magnetic held strength H and the magnetic induction B. The depen- 
dence of H and B is described by B = (H -I- M(H)), where M denotes the 
magnetization and /tq is th® permeability constant. Assuming a hyperbolic shape 
of the concentrator and strong magnetic fields, the magnetic fluid will be in sat- 
uration and an explicit analytical expresion for the magnetic held is available 
(see [6]). 

The hydrodynamical properties of the magnetic fluid are described byjhe 
incompressible three-dimensional Navier-Stokes equations in a given domain fl C 
(see [7]): 

—r|Av + gv■Vv + Vpc = ^ + ^^oMVH in 17 , (2) 

V • V = 0 in 12 . (3) 

Here, 77 denote the dynamic viscosity, g the fluid density, v the velocity, Pc = 
P/ + Pm the composite pressure with the thermodynamic pressure in the fluid 
Pf and the fluid-magnetic pressure Pm = Po /q^ M{H') dH' , f the volume force, 
M = |M| and H = |H|. 

The boundary dfi of the fluid domain consists of two free surfaces Fp and 
the remainding solid part dfl\Fp. On the free surfaces of the magnetic fluid the 
Young-Laplace equation 

aTf = Y (M • n)^ - [n • CTy(v,pc) n] on Fp, (4) 

holds, where Ti denotes the mean curvature, a is the surface tension, n is the 
unit normal vector and dij{'v,Pc), i,j = 1,2,3 is the hydrodynamic part of 
the stress tensor. Representing the free surface as an unknown graph, equation 
(4) can be rewritten in a nonlinear, nonimiformly elliptic equation. Under the 
above assumptions the magnetic fluid is in saturation, M « Mg, which gives 
Pm = PqMsH. Neglecting terms of smaller size in (4) we obtain 

poMsH{x,y,z{x,y)) = - [n • (v,p/) n] , 



( 5 ) 
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that means an algebraic equation for the unknown free surface z = z{x,y). 
Then, the following iteration has been used to determine approximatively the free 
surfaces, the velocities and pressure: For given domain f?" we solve the Navier- 
Stokes problem to find the velocity v and the pressure pj up to an additive 
constant. Indeed this is possible since the right hand side of (2) is just f + \7pm 
and Pc = Pf +Pm- Next we determine the two free surfaces Zi = Zi{x, y), i = 1,2, 
as solutions of (5) and fix the constant by requiring a given fluid volume. In this 
way we obtain a new domain Repeat this approach if the new domain 

differs from the old in some given tolerance. The solution of the Navier-Stokes 
equations 



— ? 7 Z\v + g V • Vv + Vp/ = f inf?" , 

V • V = 0 in I?" . 

is the most expensive part of the overall algorithm. 



( 6 ) 

( 7 ) 



2.2 Simplified Flow Model 

Under stationary operating conditions the fluid domain f2 can be considered 
to be rotationally symmetric. Thus, (6)-(7) can be rewritten using cylindrical 
coordinates. Since the minimum gap width a is small compared to the radius of 
the rotary shaft R, a reduced two-dimensional model can be derived. Introducing 
dimensionless variables we end up with a problem for the azimuthal velocity w, 
the secondary flow u = (ui,U 2 ), and the pressure p in the cross-section 17 C 

— -I- u • Vrc = 0 in 17 , (8) 

Re 

— — Z\u -I- u • Vu -I- Vp = ejf in 17 , (9) 

Re 

V • u = 0 in 17 . (10) 

Here, we have used the notations Re for the Reynolds number, S = a/R <C 1 for 

the relative width of the gap, and ex for the unit vector in the 7f-direction. 

The boundary 9l7 = Fp U Fc U Fs consist of two free surfaces Fp, the 
boundary of the concentrator Fq and of the shaft Fs- The following boundary 
conditions are applied: 



u = 


0 


on 


u • n = 


0 


on 


\ij • r = 


0 


on 


w = 


0 


on 


w = 


1 


on 


dw 


0 




— — 


on 


9n 



rcUTs , (11) 

rp , (12) 

Fp,i,j = 1,2, (13) 

rc , (14) 

rs , (15) 

Fp . 



(16) 
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Here, n and r are the unit normal and tangential vector, respectively, and 



denotes the hydrodynamic part of the stress tensor. Consequently, a slip bound- 
ary condition (12) and a condition on the tangential stresses (13) are imposed on 
the free boundary Fp whereas a Dirichlet (or no-slip) boundary condition (11) 
is prescribed for u on the remaining part dfip \ Fp = Fq U Fs- In contrast to 
the Navier-Stokes equations with Dirichlet boundary condition there seems to be 
rather few work concerned with numerical methods for Navier-Stokes equations 
with slip boundary condition. 

The derivation of weak formulations of the model (8)-(16) can be found in 
[5], where also the solvability of the problem is analysed. Both existence and 
uniqueness (for small 5Re^) of the weak solution follow from the general theory 
of saddle point problems, the standard Galerkin method and a weak maximum 
principle. 

3 Decoupling Solving Strategy 

The structure of the system of nonlinear partial differential equations (8)-(10) 
and the smallness of (5 1 suggests the following decoupling strategy: 

51. Start with the initial guess for the secondary flow u = 0. 

52. Solve the convection-diffusion equation (8) for the given velocity u. 

53. Calculate a new velocity u and a pressure pf from the Navier-Stokes equa- 
tions (9)-(10) using the azimuthal velocity w obtained in S2. 

54. If the discrete Z 2 -norm of the difference between two subsequently solutions 
of the convection-diffusion equation is smaller than a given tolerance then 
STOP otherwise continue with S2. 

Our iterative decoupling solving strategy resuts in a separate solution of the 
convection-diffusion part (8) and the Navier-Stokes part (9)-(10). It is gener- 
alized substantially in comparison with the decoupling presenting in [6], where 
by taking into consideration that the magnitude of the azimuthal velocity w is 
much larger than the secondary flow u one can neglect u • Vu> « 0 in (8). Thus, 
given the azimuthal velocity we can determine the secondary flow by solving the 
two-dimensional problem (9)-(10). Indeed, the decoupling in [6] is the first step 
of the our decoupling solving strategy (S1)-(S4). 

4 Finite Element Discretization 

We discretizate (8)-(16) by using the conforming P 2 /P 1 finite element pair, i.e., 
the velocity u and the azimuthal velocity w are approximated by continuous, 
piecewise quadratic functions while the pressure p is approximated by continu- 
ous, piecewise linear functions. The conforming finite element pair P 2 / Pi satisfies 
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the Babuska-Brezzi stability condition [2] , which guarantees a stable discretiza- 
tion of the saddle point problem (9)-(10). We consider a discretisation where the 
slip boundary condition (12) is incorporated in the ansatz and the finite element 
space. It is also possible to treat the slip boundary condition as a Lagrange 
multiplier, see e.g. [1]. Numerically, however, our treatment of the slip boundary 
condition is less complicated and cheaper. 

The discretisation of (8)-(10) corresponds to a linear algebraic system of 
equations for (8) and a nonlinear algebraic system for the saddle-point problem 
(9)-(10). In contrast to the numerical treatment in [1] where the slip bound- 
ary condition is considered as an additionally defined projection we incorporate 
the condition directly in the nonlinear algebraic system corresponding to the 
problem (9)-(10) with boundary conditions (11)-(13). The nonlinearity within 
the problem (9)-(I0) is resolved by using a fixed-point iteration (linearisation of 
the convective term). Thus, it remains to solve efficiently a large linear systems 
in each nonlinear iteration step. We use the geometric multigrid method as an 
effective solver. For a detailed discussion of the design of multigrid methods for 
indefinite saddle point problems we refer to [3], where a convergence proof in 
case of the stokes problem has been given. 

Within the multigrid method for solving the saddle-point problem (9)-(I0) 
we use a W-cycle with three pre-smoothing and three post-smoothing steps. As a 
smoother, we use the block Gauss-Seidel method (Vanka-type smoother, see [8]). 
The linear algebraic system corresponding to the discretisation of (8) is solved 
also by applying a geometric multilevel approach. As a smoother we consider the 
ILU method and use W-cycle with two pre-smoothing and two post-smoothing 
steps. 

5 Numerical Results 

In this section, we present the numerical results with respect to the behaviour 
of the decoupling solving strategy (S1)-(S4) in a fixed domain. To meet a real- 
istic situation geometric data of the free boundary from [6] have been used. All 
presented numerical results have been obtained by the program MooNMD [4]. 
The computations were performed on a HP workstation J5600. 

5.1 The Test Problem 

We consider a rotary shaft seal with a minimum gap width a = 1 as a testexam- 
ple. Figure 2 shows schematically the fluid domain fl and also the triangulation 
of the refinement level 2. All finer meshes are constructed by uniform refinement 
and boundary adaptation. 

The calculations correspond to a Reynold number of Re = 63.2456 and a 
dimensionless gap width <5 = 0.01. 

Table 1 gives the number of cells and degrees of freedom for the Pi / P\ pair 
on the different refinement levels. Since we consider a two-dimensional problem 
the number of unknowns increases by a factor of about 4 from one level to the 
next finer level. 
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Fig. 2. Geometry of the fluid domain f? (left) and the triangulation at level 2 (right) 
Table 1. Number of cells and degrees of freedom at each multigrid level 



Level 


2 


3 


4 


5 


6 


Number of cells 


32 


128 


512 


2048 


8192 


Azimuthal velocity 


85 


297 


1105 


4257 


16705 


Velocity 


170 


594 


2210 


8514 


33410 


Pressure 


27 


85 


297 


1105 


4257 



Table 2. Linear iterations in decoupling step S2, nonlinear iterations (total number 
of liirear iterations) iir decoupling step S3 and computing times in sec. pro decoupling 
iteration on the multigrid levels 5 and 6 



Decoupling 

iteration 


Linear 

iterations in S2 


Nonlinear (linear) 
iterations in S3 


CPU time 
in sec. 


Level 4 


1 


5 


4 (7) 


5.65 


2 


3 


2 (11) 


8.84 


3 


2 


1 (13) 


10.44 


4 


1 


1 (13) 


10.43 


Level 5 


1 


5 


3 (6) 


21.32 


2 


3 


1 (9) 


31.79 


3 


2 


1 (10) 


35.30 


4 


1 


1 (10) 


35.26 


Level 6 


1 


5 


3 (5) 


73.95 


2 


2 


1 (7) 


103.84 


3 


1 


1 (7) 


103.67 


4 


1 


1 (7) 


103.50 



5.2 The Behaviour of the Decoupling Solving Strategy 

The numbers of linear iteration for solving the problem (8) in S2, the numbers 
of nonlinear iterations also the total number of the linear iterations within the 
decoupling step S3 pro decoupling iteration on different multigrid levels are 
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presented in Tab. 2. The decoupling iteration on each multigrid level is stopped 
if the discrete ^ 2 -norm ||w® — is less than 10“®, where rc* and are 

two sequently solutions in S2 on a given multigrid level. The nonlinear fixed- 
point iteration in decoupling step S3 is stopped if the Euclidean norm of the 
residual is less than 10“^°. The linear systems within the nonlinear iteration 
and the linear system in S2 corresponding to the discretization of the problem 
(8) are approximately solved until the residual is reduced by the factor 0.1 or 
the residual is less than 10“^°. Table 2 gives also the computing times in sec. 
pro decoupling iteration on the multigrid levels 4, 5 and 6. The presented times 
show the efficiency of our decoupling solving strategy. 

Table 3. Discrete ^ 2 -norm of the difference between the solution after one decoupling 
iteration and this one after the last iteration on the finest multigrid levels 



Level 


\\ 5 rn 

Ik lls.d 




\\P^-Pl2., 


4 


2.997817e-05 


2.356857e-07 


1.075455e-07 


5 


8.629506e-06 


1.813309e-07 


2.923259e-08 


6 


3.342695e-06 


5.438233e-08 


1.242328e-08 



Table 3 shows the discrete ^ 2 -norms of the difference between the solution 
after one decoupling iteration and this one after the last iteration with respect 
to different multigrid levels. Our results verify the simplified solving procedure 
used in [6]. 

The computed azimuthal velocity, pressure and stream function on the level 
6 after two decoupling iterations are visualized in Fig. 3. 




Fig. 3. Isolines of the computed azimuthal velocity (left), pressure(center) and stream 
function(right) on the level 6 after two decoupling iterations 



Acknowledgements 

For supporting the research presented in this paper, the authors like to thanks 
the German Research Foundation (DFG) under grants TO 143/4 within the high 
priority research program SPP 1104. Several pictures were produced using the 
visualisation package GRAPE. 



Numerical Simulation of the Flow in Magnetic Fluid Rotary Shaft Seals 403 



References 

1. Bansch, E., Hohn, B.: Numerical treatment of the Navier-Stokes equations with 
slip boundary condition. SIAM J. Sci. Comp., 20 (2000) 2144-2162 

2. Girault, V., Raviart, P.: Finite element methods for Navier-Stokes equations. The- 
ory and algorithms. Springer- Verlag, Berlin, 1986 

3. John, V., Knobloch, P., Matthies, G., Tobiska, L.: Non-nested multi-level solvers 
for finite element discretizations of mixed problems. Computing, 68 (2002) 343-373 

4. John, V., Matthies, G.: MooNMD - a program package based on mapped finite 
element methods. Technical Report 02-01, Fakultat fiir Mathematik, Otto-von- 
Guericke-Universitat Magdeburg (2002) 

5. Mitkova, T.: Losbarkeit eines mathematischen Modells fiir Dichtungen mit mag- 
netischen Fliissigkeiten. Technische Mechanik, 20 (2000) 283-293 

6. Polevikov, V., Tobiska, L.: Modeling of a dynamic magneto-fluid seal in the pres- 
ence of a pressure drop. Fluid Dynamics, 36 (2001) 890-898 

7. Rosensweig, R. E.: Ferrohydrodynamics. Dover Publications, Inc. Mineola, New 
York, 1997 

8. Vanka, S.: Block-implicit multigrid calculation of two-dimensional recirculating 
flows. Comp. Meth. Appl. Mech. Eng., 59 (1986) 29-48 



Phase-Field Method for 2D Dendritic Growth 



Vladimir Slavov^, Stefka Dimova^, and Oleg Iliev^ 

^ Sofia University, Faculty of Mathematics and Informatics 
5 James Bourchier Blvd., 1164 Sofia, Bulgaria 
vladimir-slavovOyahoo . com, dimova@fmi.uni-sofia.bg 
= ITWM, FHG, 

Europaallee 10, Room 1-C12, D-67657 Kaiserslautern, Germany 
iliev@itwm . f hg . de 



Abstract. The phase field method is used to model the free dendritic 
growth into undercooled melt in 2D. The pair of two nonlinear reaction- 
diffusion equations — for the temperature and for the phase-field func- 
tion — are solved numerically by using the finite element method in 
space. A second order modification of the Runge-Kutta method with ex- 
tended region of stability and automatic choice of the time step is used 
to solve the resulting system of ordinary differential equatioirs in time. 
Numerical experiments are made to analyze the evolution in time of a 
spherical seed in the isotropic case and for different kinds of anisotropy. 
The results for the dimensionless dendritic tip velocity, found in the dif- 
ferent cases, are compared with the results of the microscopic solvability 
theory and with numerical results found by the finite difference method. 



1 Introduction 

Free dendritic growth is a frequently observed growth mode in casting and weld- 
ing processes. The understanding and the control of its structures are of great 
importance to metallurgy, because the formatted microstructure determines the 
quality of the solidified material. As a highly nonlinear phase transformation pro- 
cess, the dendritic growth is interesting from mathematical and computational 
viewpoints as well. 

The solidification of a pure material belongs to the class of first order phase 
transitions where the energy of self-organization is liberated during the process. 
The Stefan problem for melting or freezing is obtained by evaluating the heat 
balance of the system and models the process only on a macroscale. Formation 
of microstructure in the system is a result of changes in another kind of energy, 
linked to the structural organization (free energy or entropy). The heat and the 
free energy interact during the solidification process and can lead to development 
of unstable complex shapes in the solid subdomain (cells, dendrites). 

The dendritic growth is characterized by propagation of a primary tip and 
nonlinear evolution of secondary and tertiary side-branches on a microscopic 
scale (from 10“® to 10“^m). Equiaxed grains result if the growth is from free 
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nuclei into a supercooled melt. An assemblage of many equiaxed dendritic grains 
constitutes the so-called mushy zone on a macroscopic scale (10“^m). The mod- 
eling of equiaxed dendritic solidification on the scale of a typical mushy zone thus 
requires the simultaneous consideration of growth and transport phenomena over 
five orders of magnitude. Between the microscopic and macroscopic length scales, 
a mesoscopic scale is defined which is of order from 10“^ to 10“^m. A mesoscopic 
unit cell inside a mushy zone typically contains about 10 equiaxed grains and 
thousands of dendrite arms and tips. Such unit cells or representative elementary 
volumes form the basic building blocks of macroscopic or volume-averaged solid- 
ification models used in the simulation of casting processes. Accurate modeling 
of the dendritic growth processes is not only important for the prediction of the 
final grain structure of a solidified material, but also for feeding local information 
back to the macroscopic model. 

The aim of our work is to develop numerical method and algorithm for solv- 
ing the phase field formulation of the dendritic growth problem in 2D and to 
show its reliability by comparing the numerical results with current theories of 
dendrite tip selection as the microscopic solvability theory and with some known 
numerical results. Our further aim will be to apply the developed technique for 
investigation the dendritic growth in some real metal alloys, which are used in 
the metallurgy, but almost nothing is known about their structure. 



2 Modified Stefan Problem and Phase-Field Model 

In the phase transition theory, two streams can be distinguished. 

The Gibbs approach considers a sharp phase interface where a discontinuity 
of some variables is assumed. One of the consequences is the classical Stefan 
problem where the interface is defined by the melting temperature Tm- To take 
into account the surface effects (surface tension, undercooling), the classical Ste- 
fan problem has been modified. In case of a pure solidifying substance, the 
Gibbs-Thomson relation has been included, giving thus the following problem 
(the material parameters of the liquid and solid phases are assumed equal and 
constant): 



dU 

IH = " ' 


(1) 


to 

P 

s 

+ 

1 

e 

1 


(2) 


-d{n)n - !3{n)vn , 


(3) 



where: 

— u = {T — T^)l{Llcp) is the dimensionless temperature; 

— D is the thermal diffusivity; 

— T is the latent heat of fusion; 

— Cp is the specific heat at constant pressure; 

— n is the outer normal to the solid subdomain; 

— Vn is the normal velocity of the interface; 
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— dn dnu\ are the normal derivatives of the temperature at the interface 

for the solid (+) and liquid (-) phases; 

— /t is the local curvature; 

— d{n) = 'y{n)TmCp / is the capillary length; 

— 7 (n) is the surface tension; 

— /3(n) is the kinetic coefficient. 

The van der Waals approach assumes a non-sharp but thin transition layer 
between phases where all the thermodynamical parameters vary continuously. 
The phase-field method for numerical simulation of phase transition problems is 
based on this approach. The thin transition layer is considered as a diffuse region 
with small but numerically resolvable thickness. A variable (j>, called phase- field 
variable, is introduced. It varies smoothly from -1 in the liquid to -1-1 in the solid 
phase. This approach is attractive because the explicit tracking of the interface 
and accurate computation of interface normals and curvatures is completely 
avoided. An evolution equation for the phase-field variable is solved instead and 
the solid-liquid interface is defined by the level set (^ = 0. In the computations, 
it is important to understand how the quality of the solution depends on the 
interface thickness, because the grid spacing need to be of the order of or smaller 
than the interface thickness. 

It must be noted that there are many ways to prescribe a smoothing and 
dynamics of the sharp interface model, so that there is no unique phase-field 
model. We construct the numerical method on the basis of the phase-field model 
used in [1"3]. 

In the isotropic case {din) = do = const, (3{n) = (do = const) the dimension- 
less form of the phase-field equation is: 

= + + + . (4) 

Here the length is measured in units of S (see (6) and the explanations bellow), 

r2 Q c 

the time - in units of A = is a dimensionless parameter. 

In the anisotropic case the dimensionless phase-field equation is: 

T^^=V{W^V<() + d4\Wc(\'^Wd^^W]+dy[\V(>\^Wd^^W]-d^F{c(,Xu) , (5) 
where: 

— F{(j),Xu) = f{4>) + Xug{4>) is the free energy and 

— W = 5 As, As = (1 — 3e) -I- and e is the anisotropy strength; 

— r = tqAs and tq is the characteristic time. 

The stationary solution of both equations is: 
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where n is the coordinate normal to the interface and 3-\/2i5 is the interface 
thickness over which </> varies from -0.9 to 0.9. We use this stationary solution 
to validate our computations. In both cases the equation for the temperature is: 

du ld(j> 

So the pairs (4), (7) and (5), (7) constitute the phase-field model. 



3 Numerical Method 



We consider the problem for a single spherical seed K° = {{x, y) '■ x'^ + y'^ < r^}, 
posed at time t = 0 in the center of the square 17° = {[— Z, 1] x [— 1, Z]}. Because 
of the symmetry, the computational domain is 
17={0<a:</, 0<y<l}, K=nf]K°. 

So we solve the problems: find u = u{x,y,t) and <f> = 4>{x,y,t), {x,y) S 17, 
0 < t < T, which satisfy: 



du 



A 1 ^ 4 ’ 



( 8 ) 



o 1 

= A(j) + ^(1 — 44) — A m(1 — 44)'^ {for the isotropic case) , (9) 

T^^=v{w^V4>)+d,[\V4>\^wd^^w\+dy[\V4>?wd^^w\+4>{i-44)-\u{i-44)^ 

( 10 ) 

the initial conditions: 



u(a;,y, 0) = 0, y{x,y) e K, 


u{x, y, 0) = Uo, V(x, y) € f2\K , 


(11) 


4>{x,y,Q) = 1, v(x,y) € K, 


4(x, y, 0) = -1, V(x, y) e n\K 


(12) 


and the boundary conditions: 






^ = 0 ^ = 0 
on on 


Vt > 0, V(x,y) € 5l7 . 


(13) 



To solve the systems (8), (9), (11)-(13) and (8), (10), (11)-(13) we use Galerkin 
finite element method. For semidiscretization in space Lagrangian finite elements 
are used and then a system of ODE is solved in time. 



3.1 Semidiscretization in Space 

The weak form of the problem (8), (10) is: find u{x, y, t) and 4>{x, y, t) that satisfy 
the integral identities 



fdu f dudv dudv 1 



d4> 

Jt 



V dx dy 



To 



,d4) 



d4> dv d4> dv , 



A—vdxdy=- / W (^^ + ^^)dxdy- 



lo 



■dt 



dx dx dy dy 
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I" ^s4>x(l>y{(l>l - 4>l) ^ d4> dv d(j)dv 



la (</>^ + ^dydx dx 

— (j)^) — Au(l — (j)^Y)vdxdy , Vt> S . 

The corresponding semidiscrete problem is 



/ 

Ja 



+ (14) 

+ MF^^\<P) + MF(2)(<^, U{t)) , (15) 

U{Q) = U^^\ <Z>(0) = . (16) 

In the isotropic case the equation for <P reads: 



PI— = -K<P(t) + MF^^\<P) + PIF^^\<P,U{t)) . (17) 

For numerical integration we use Lobatto quadrature formulas. This gives a 
diagonal constant mass-matrix M. The matrix (F) is symmetric, (F) is 
antisymmetric. The matrix M(F) is not a diagonal one, but in the computations 
we use the lumped matrix M{F). 

To solve the ODE systems (14)-(16) and (14), (17), (16) we use a second 
order explicit modification of the Runge-Kutta method [5] with extended region 
of stability. The time step is chosen automatically so as to guarantee stability 
and a given desired accuracy at the end of the time interval. 



4 Numerical Experiments 

In the experiments below a uniform partition into bilinear finite elements of size 
h is used. 

In the first set of experiments we examine the evolution in time of the phase- 
field function and compare its behavior with the stationary solution (6). This 
experiment is important for validation of our computations. 

We solve Eq.(9) in the domain Q = {{x,y) :0<a;<80, 0<y< 80}. 
under the initial condition (12), where r = 8. The values of the other parameters 
are: A = 1, u = 1, (5 = 1, h = 0.4, so during the steady-state stage of the process 
11-12 mesh points must belong to the interface thickness 3\^S over which cj) 
varies from -0.9 to 0.9. Our computations confirm this. Fig.l shows the profiles 
of the stationary solution (6) and the profiles of 4> for different times. It can be 
seen that for t = 1 the profile of cj) is the same as the stationary one and it is 
well preserved further. 

The second set of experiments is devoted to investigate the evolution in time 
of a single spherical grain into a supercooled melt. 

The experiments in the isotropic case (Eq.(8), (9), (11)-(13)) show the ex- 
cellent preservation of the spherical form of the grain up to times, when the 
boundary conditions (13) begin to act. 
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Fig. 1. Profiles of ifi 



In the anisotropic case (Eq.(8), (10), (11)-(13)) the formation of primary 
dendritic tips is investigated. It is shown by physical experiments [4], that after 
some time the dendritic tip velocity comes to a stationary state. This velocity is 
an important characteristic of the process, so we ivestigate it as well. 

The computational domain is: 17 = {{x,y) : 0 < x < 240, 0 < y < 240}, 
the radius of the seed is r = 4. To choose the values of the other parameters 
we use the asymptotic analysis, reported in [2], which connects the phase-field 
model (5), (7) and the modified Stefan problem (l)-(3). In terms of Ag, d{n) = 
do(4ls + &gAs) and j3{n) = PqAs, where 6 is the angle between n and the x-axis; 
noting that taii6l = these expressions become: /3(n) = /3o(l -I- ecos461) and 
d{n) = do{l — 15ecos40). The parameters of the phase-field model are related 
to that of the modified Stefan problem by: A = -|- where 

oi = 0.8839, tt 2 = 0.6267. In the computations /3 q = 0, as it is in [6]. These 
parameters correspond to the metal-model material succinonitrile (SCN), for 
which the Isothermal Dendritic Growth Experiment [4] (launched on the space 
shuttle by NASA in March 1994) and some computations [6] are made. 

Fig. 2-5 show the evolution of one and the same spherical seed and its cor- 
responding dimensionless tip velocities for different values of the anisotropy e, 
the undercooling uq, the thermal diffusivity D and the parameter dg- Again 
i5 = 1, /i = 0.4. All our computations show, that in the steady-state stage of the 
evolution the same number of 11-12 mesh points are in the interface thickness. 

In the table bellow the values of the dimensionless tip velocity, found in our 
computations (column SDI) are compared with these numerically obtained in [6] 
by using finite difference method (column B), and with those predicted by the 
microscopic solvability theory (column MST). 
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do 


D 


e 


Uo 


SDI 


B 


MST 


0.139 


4 


0.05 


-0.55 


0.0174 


0.0171 


0.017 


0.185 


3 


0.05 


-0.55 


0.0178 


0.0175 


0.017 


0.139 


4 


0.03 


-0.55 


0.0110 


0.0112 


0.0111 


0.185 


3 


0.03 


-0.55 


0.0124 


0.012 


0.0111 



The experiments, made with values of S in the interval 0.5 < <5 < 1 (6-12 points 
in the interface thickness), give results which are in the same good agreement 
with the other results in the table. 
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Fig. 2. e = 0.05, mq = -0.55, = 4, do = 0.139 





Fig. 3. e = 0.05, uo = -0.55, = 3, do = 0.185 
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Abstract. The paper presents a method for designing 2-D FIR linear- 
phase filters using a closely connected parallel machine as a computation 
platform. The 2-D FIR filter design according to the equiripple error 
criterion in the passband and the least squares error criterion in the 
stopband is considered. In the method, the filter design problem is trans- 
formed into an equivalent bicriterion optimization problem. The original 
problem is partitioned between processors using the regular domain de- 
composition method. A design example demonstrating the performance 
of the proposed approach is presented. Possibility of obtaining equiripple 
solution both in the passband and in the stopband is also discussed. 



1 Introduction 

In recent years, with the rapid improvement in computer technology, two dimen- 
sional (2-D) digital signal processing has become more important. Therefore, the 
design problem of 2-D digital filters has been receiving a great deal of attention. 

Digital filters can be classified into two groups, i.e., finite impulse response 
(FIR) filters and infinite impulse response (HR) filters. Since FIR digital filters 
are inherently stable and they can have linear phase, they are often preferred 
over HR filters. 2-D FIR filters have many important applications, e.g., in radar 
and sonar signal processing and in image processing. 

Techniques for designing 2-D FIR digital filters have been developed exten- 
sively for several years [1-4] . The results of most of these techniques are given in 
the form of the impulse response of a 2-D filter, so the designed filter is suitable 
for a direct convolution realization. In case of the methods based on approxi- 
mating some specified frequency response, the least-square (LS) or the minimax 
error criteria are usually used. By using the LS error criterion, one gets an over- 
shoot of the frequency response at the passband and the stopband edges caused 
by Gibbs’ phenomenon [2]. The minimax design yields an equiripple solution and 
avoids the overshoot problem [1], but the solution is not necessarily unique and 
may fail to converge [3,1,5]. Besides, the existing design techniques are com- 
plicated and require long computational time [1,2]. In many cases, the design 
based on the equiripple error criterion in the passband and the LS criterion in 
the stopband is much more appropriate than the pure minimax or LS design [2] . 
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In the paper, a new approach for the design of 2-D linear-phase FIR filters is 
proposed. This approach is based on the equiripple error criterion in the passband 
and the LS criterion in the stopband. In the proposed approach, the design 
problem is transformed into an equivalent bicriterion optimization problem. The 
method parallelization of the considered problem is also discussed. 

2 Formulation of the Design Problem 

A 2-D FIR filter has a complex frequency response of the general form given 
by [3] 

M N 
m—0 n—0 

where h{m, n) is the rectangularly sampled impulse response of the filter, M 
and N represent the lengths of the filter, and and LO2 are the horizontal and 
vertical frequencies, respectively. 

2-D FIR filters can be designed to have purely real frequency responses. Since 
the imaginary component of the frequency response is equal to zero, the phase 
response of such a filter will also be equal to zero and therefore these filters are 
called zero-phase filters. The frequency response H{uji,uj2) of a zero-phase filter, 
also called a zero-phase frequency response, is a real function so that 

H{loi,lu 2 ) = H* {101,102) . (2) 

Strictly speaking, not all filters with real frequency responses are zero-phase 
filters, because H {001,002) can be negative. In practice, H {001,002) is negative 
only in the stopband of the filter and this phase shift in the stopband has little 
significance. 

From the symmetry properties of the Fourier transform, equation (2) is equiv- 
alent in the spatial domain to: 

h{m,n) = h* {—m, —n). (3) 

For real h{m,n), equation (3) reduces to 

h{m,n) = h{—m, —n). (4) 

From equation (4) it can be seen that the impulse response of a zero-phase filter 
is symmetric with respect to the origin. 

As h{m, n) has twofold symmetry, only approximately half of the points in 
it are independent. The region of support of h{m,n) consists of three mutually 
exclusive regions: the origin (0,0), i?+ and R~ . The regions and R~ are 
flipped with respect to the origin. As a result, the zero-phase frequency response 
can be expressed as [3] 

H{loi,oo 2) = h{ 0 , 0 ) + ^ 2 h{m,n) cos{rriLOi + 71002)- 

(m,n)Gfl+ 



( 5 ) 



414 Felicja Wysocka-Schillak 



In case of a linear-phase filter, the frequency response is given by 



where the complex exponential part denotes the linear-phase characteristic of the 
filter and the zero-phase frequency response H (101,102) represents the magnitude 
response A of the filter {A = |ff(u;i,a;2)|)- 

Let the considered filter be a quadrantally symmetric filter. The zero-phase 
frequency response of this filter can be written in the form [ 1 ] 

Li L2 

H{c0i,LO2) = EE a{m,n) cos{mooi) cos{nco 2 ), ( 7 ) 

m—0 n—0 



where a(m, n) are free filter coefficients which can be expressed in terms of 
the impulse response h(m,n) [ 1 ], and Li and L2 are the numbers of free filter 
coefficients (M = 2 L\ + 1 ,N = 2L2 + 1 ). 

The quadrantally symmetric FIR filter design problem is to find the filter 
coefficients a{m,n), such that the zero-phase frequency response of the filter is 
the best approximation of the desired zero-phase frequency response Hd(uJi,L02) 
in the given sense. 

Let Y = [yi,y2, ■ ■ ■ , y(Li+i){L2+i)V’ be a vector of filter coefficients such that: 

yi = a{m,n), i = 1,2, . . . ,{Li + 1 ){L 2 + l)\ to = 0, 1, . . . , Li; n = 0, 1, . . . , L 2 . 

( 8 ) 

Assume that the continuous {ooi ,002) — plane is discretized by using a Ki x K2 
rectangular grid {ooik,uj2i), k = 0 , 1 , . . . , Ki — 1 , I = 0 , 1 , . . . , K2 — 1 - The desired 
zero-phase frequency response Hd{toik,t02i) of the 2 -D filter is: 



, \ 1 (wife,W2z) in the passband P, 

g (coik,i02i) in the stopband S. 



( 9 ) 



Let iJ(cji,tt>2, Y) denote the zero-phase frequency response of the filter ob- 
tained by putting into equation ( 7 ) the coefficients given by vector Y. The error 
function E{toik,uj2i,y) in the passband P is: 



E{LOik,002i,Y) — H{iOik,u!2i,Y) — Hd{LOik,002i), 001,002 &P- ( 10 ) 

In the stopband S, the LS error E 2 (Y) to be minimized is: 

E 2 {Y)= Y. [H{oJik,LO 2 l,Y)-Hd{ 0 Olk,UJ 2 l)r- ( 11 ) 

The quadrantally symmetric filter design problem can be formulated as fol- 
lows: For given zero-phase frequency response Hd{oo\k,oj2i) defined on a rectan- 
gular grid Ki X K2, and prescribed values Li and L2 find a vector Y for which 
the error function E{ooik,oo2i,Y) is equiripple in the passband and the LS error 
E2{Y) is minimized in the stopband. 
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3 Reformulation of the Problem 

The considered design problem can be transformed into an equivalent bicriterion 
optimization problem. In order to do this, two objective functions Xi(Y) and 
X 2 (Y) should be introduced. The function Yi(Y) is assumed as having the 
minimum equal to zero when the error function Y) is equiripple in 

the passband. The error function E(uji}^^uj21,'^) is equiripple in the passband 
when the absolute values AEi(Y), i = , of all the local extrema 

of the function E{ujik,uj 2 i,'y) in the passband, as well as the maximum value 
AEj+i{Y) of E{ujik,i 02 i,'^) at the passband edge are equal, i.e.: 

AEi{Y) = AE2{Y) = --- = AEj+i{Y). (12) 

In order to find a vector Y, for which the conditions (12) hold, we introduce an 
objective function Xi{AEi, AE 2 , . . . , AEj+i) defined as follows: 

,7+1 

X^{AEuAE 2 , . . . , AEj+i) = - i?)2, (13) 

i=l 



where: 

^ ,7+1 

(14) 

7=1 

is the arithmetic mean of all AEj, j = l,2,...,J+l. 

The function Xi defined above is non-negative function of AEi, AE 2 , ■ ■ ■ , 
AEj+i and it is equal to zero if and only if AEi = AE 2 = ... = AEj+i. As 
AEx, AE 2 , . . . , AEjj^i are the functions of the vector Y, the function Xi can be 
used as the first objective function in our bicriterion optimization problem. As 
the second objective function we use the LS error A 2 (Y), so A 2 (Y) = E 2 {Y). 

Using the weighted sum strategy, which converts the bicriterion optimization 
problem into a single criterion one, the equivalent optimization problem can be 
stated as follows: For given filter specifications and a weighting coefficient j3 find 
a vector Y such that the function 

A(Y,/3) =/3Ai(Y)+A2(Y) (15) 



is minimized. 

The above formulated optimization problem can be solved using standard 
minimization methods. 

Note that using the proposed approach it is also possible to obtain a filter 
with the equiripple passband and stopband. In this case, the extrema of the 
error function E{ujik,uJ 2 i,Y) should be calculated both in the passband and in 
the stopband as well as at the passband and stopband edges and the objective 
function Yi(Y) should be suitably modified. Then the problem of finding a 
vector Y such that the function Ai(Y) is minimized should be solved. 
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4 Parallelization of the Problem and Design Results 

There are different methods of decomposing problems to run on a parallel ma- 
chine. In our problem, we obtain two K\ x K 2 matrices, i.e., the marix whose 
elements represent the desired zero-phase frequency response and 

the matrix H whose elements represent the actual zero-phase frequency response 
H{uJik,uJ 2 i,Y) of the filter calculated using the equation (7). The matrices Hd 
and H can be treated as regular grids of data elements so in this case the regular 
domain decomposition method can be applied. In this method, the regular grid 
structure is split into regular subgrids and these subgrids are distributed to sep- 
arated processes where they can be operated on. As the result, the calculations 
can be performed in parallel and the problem can be solved in a smaller time 
scale. 

In case of our problem, the decomposition strategy is to divide two matrices 
Hd and H into equally-sized, equally-shaped parts and assign the suitable parts 
of these matrices to different processes. The number of these parts depends on 
the number of the available processors. 

At each iteration, the processor j, where j = 0,1, NP — 1 and NP 
is the number of all available processors, calculatates AEi{Y), i = l,2,...,fc, 
inside its data block. Note that in order to check if a point is an extremum, 
the values of the function in the neighbouring points should be known. So the 
considered data blocks must have overlap areas with the thickness of one element 
in vertical dimension. When all AEiiY), i = 1,2, . . . , J +1, are determined they 
are accumulated on processor 0 in order to calculate the arithmetic mean R of 
all of them. Then the sum Aiij(Y) defined as follows: 

k 

Xy(Y) = ^(AA,-i?)2, (16) 

i=l 

where k is the number of AEi(Y) determined by process j, is calculated by 
every processor j, j = 0,1, ..., NP — 1. Every processor j also calculates the 
part E 2 j{Y) of the total stopband error i? 2 (Y). All A'ij(Y) and E 2 j{Y) are 
gathered by processor 0, where the objective function AT(Y) is calculated as the 
following sum: 

AfP-l AfP-l 

X{Y)=(3 X,,{Y)+ Y, E2j{Y). (17) 

j=0 j=0 

The determined value of A(Y) is used in the mimimization problem. 

A computer program based on the described design procedure was written in 
Fortran using MPI library and executed on closely connected parallel machine 
(Sun Fire 6800). In our program, the unconstrained minimization problem is 
solved using the Polak-Ribiere version of the conjugate gradient algorithm [4]. 

To illustrate the performance of the proposed technique, we consider the 
design of a quadrantally symmetric diamond shaped filter for M = N = 15 and 
/3 = 2 X 10^. The passband of the filter is situated between the points (0, O.Stt), 
(O.Stt, 0), (0, —O.Stt) and (—O.Stt, 0) on the (wifc, W 2 ;)-plane. The width of the 
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Fig. 1. Magnitude response of the filter designed in the example (x = cji / tt , x = uj2l'x) 




Fig. 2. The achieved speed-up (solid line) 



transition band is O.IBtt. A square grid of 101x101 points is used. The resulting 
magnitude response A of the filter is shown in Fig. 1. 

To compare the execution times, the program was run using various number 
of processors. The achieved speed-up is shown in Fig. 2. Note that the best 
speed-up is achived for small number of processors. 
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5 Conclusions 

A technique for the design of 2-D linear-phase FIR filters with quadrantally 
symmetric magnitude response has been proposed. The technique is simple to 
implement and standard optimization procedures can be used to solve the con- 
sidered minimization problem. The proposed technique is also relatively easy 
to parallelize because the original problem can be partitioned into independent 
parts that can be distributed to different processors for solution. The presented 
approach can also be applied to the design of one-dimensional FIR filters [6] . 
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Abstract. The parallel properties of three fast direct solution methods 
for linear systems with separable block tridiagonal matrices and a related 
C/MPI code are studied. Fast algorithm for separation of variables and 
two new variants of the generalized marching algorithm are first summa- 
rized. The results from numerical tests performed on two coarse-grained 
parallel architectures are then reported. The obtained speed-up and effi- 
ciency coefficients are compared. The presented results confirm that not 
always the best sequential solver has the best parallel performance. 



1 Introduction 

A measure for the efficiency of a given sequential direct solver is its computa- 
tional complexity. The speed-up and efficiency coefficients play a key role in the 
analysis of parallel algorithms. They are based on the times per processor for 
computations Tcomp = J^*ta and communications Team = ci * G + C 2 * Here, 
Af is the number of arithmetic operations, ci characterizes the number of stages 
at which communications are needed and C 2 is the amount of the transferred 
words. Both ci and C 2 can be constants or functions of the number of processors 
and/or the size of the problem. The parameters ta, tg and depend on the 
parallel computer. The largest one of them is tg and it could be hundreds and 
even thousands times larger than That is why one and the same parallel 
solver may have different behaviour on machines with different characteristics. 

Our goal was to compare the performance of three direct solvers on two 
of the coarse-grained parallel architectures available in Texas A&M Univer- 
sity (TAMU). We summarize in this paper results obtained on SGI Origin 
2000 and Beowulf cluster. Parallel implementations of two variants (denoted for 
brevity GMF and GMS) of generalized marching (GM) algorithm are proposed 
and experimentally compared. The third considered solver is fast algorithm for 
separation of variables (FSV). The technique for incomplete solution of prob- 
lems with sparse right-hand sides (SRHS) proposed independently by Banegas, 
Proskurowski, and Kuznetsov is a principal step used at different stages in each 

* Supported in part by the USA National Science Foundation under Grant DMS 
9973328, by the Bulgarian Ministry of Education and Science under Grant MU- 
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of the studied solvers. More theoretical aspects of the problems with sparsity 
are investigated by Kuznetsov in [5]. The algorithm FSV is proposed in [9]. Its 
parallelization aspects for Poisson equation on nonuniform mesh are analyzed 
in [7]. Here we consider a slightly different variant proposed in [4]. The GM 
algorithm is first developed in [1,2] and later reformulated in [8] using SRHS 
and FSV (the latter is denoted in the present paper by GMF). The algorithm 
GMS is a modification of GM based on SRHS instead of FSV. We introduce it 
to improve the parallel properties of the GM algorithm. 

In the rest part of the paper, we first briefly present the algorithms (Section 2) 
and their parallel implementation (Section 3). In Section 4 are compared not only 
the obtained speed-up and efficiency coefficients, but also the measured cpu- 
times. The obtained results show that although GMS is the slowest sequential 
algorithm, its parallel implementation PGMS has better properties and on some 
machines it is asymptotically the best one among the considered solvers. 

2 Fast Separable Solvers 

We start the exposition in this Section with formulation of the problem followed 
by description of SRHS. Next we briefly review the FSV and the two variants 
GMF and GMS of the GM algorithm. For more details see [6, 3]. 

Formulation of the problem. A Dirichlet boundary value problem (bvp) for 
a separable second order elliptic equation with variable coefficients is considered. 
It is discretized on rectangular n x m grid by finite differences or by piece-wise 
linear finite elements on right-angled triangles. Using the identity n x n matrix 
the tridiagonal, symmetric and positive definite matrices T = and 

B and the Kronecker product Crmxm 

G = ths obtained system is written in the form: 

Ax = {B ® In + Im ® T)x = f , (1) 

where x = (xi,X 2 , . . . ,x„)’^, f = (fi,f 2 , . . . Xj,fj e R", j = 1, . . . ,m. 

Incomplete Solution Technique. It is assumed that the right-hand side 
(RHS) f of (1) has only d nonzero block components and only r {r, d m) block 
entries of the solution are needed. Let for definiteness = 0 for j yf ji, j 2 , • ■ • ,jd- 
To find Xyj , Xj' , . . . , Xjv , the well-known algorithm for discrete separation of vari- 
ables is applied taking advantage of the RHS sparsity: 

Algorithm SRHS 

Step 0. determine all the eigenvalues {^k\'k=i needed d < r + d entries 

{qfej}, j = ji, , ■ • • , 3d, and j = j'l, . . . , X, of all the eigenvectors of 

the tridiagonal matrix B ; 

Step 1. compute the Fourier coefficients (3i^k of I' i from the equations: 

d 

ft.fc = q! • f/ = X] : * = 1, • ■ ■ , n, fc = 1, • • ■ , TO ; 

S = 1 
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Step 2. solve m n x n independent tridiagonal systems of linear equations: 

^ — f 7 ■ ■ ■ 5 ^ j 

Step 3. reeover r components of the solution per lines using 

m 

Xj = X! ‘li’kVk , j = fiJ'2, ■■■J'r 

k—1 

End {SRHS}. 



Fast Algorithm for Separation of Variables consists of forward (FR) and 
backward (BR) recurrence. Let for simplicity m = 2* — 1, / € Z. At each step 
k of FR and BR, systems with specific sparse RHS are constructed and solved 
incompletely using Algorithm SRHS. The compact form of FSV is: 
Algorithm FSV 
FR: Set f(^) = f 



for k = 1 to I — 1 

for s = 1 to 2'-'^ solve 



incompletely, finding only 
end {loop on s} 
for s = 1 to — 1 compute 



{k,s) (k,s) (k,s) 

7 -^2^ — 1 



(,(fe+i) 

s2'= 



f {k) 7 z. \ 

~ 0s2'=,s 2'=-1^2''-1 ~ ®s2'=,s2'“ + l^l 



Ak,s) 



(fe.s+1) 



end {loop on s } 
end {loop on k} 

solve incompletely Ax*^*) = f*^^^ only for x^^.i = X 21-1 
for fc = / — 1 down to 1 

for s = 1 to 2^“^ solve incompletely 



only for 



Then set X( 2 ^_i) 2 fc-i = y^t-i + x^t-1 
end {loop on s} 
end {loop on k} 

End {FSV}. 

The matrices consist of 2^ — 1 blocks of order n and have the form 

^(fe.s) _ s = 1, 2, ... , 2^“^. They are constructed using the 

principal sub-matrices = tridiag{6sj,+z, «*,+*-!, 

of B for Sk = {s — 1)2^. The RHS have one nonzero component for the FR and 
two for the BR. It is important to note here that the update of the RHS 
for each k in FR and the consecutive recovering of the solution of (1) in BR 
require the data determined at the previous stages. 



Generalized Marching (GM) Algorithm is a stabilized version of the stan- 
dard marching (SM) algorithm. We assume that m -I- 1 = p{k + 1) for suffi- 
ciently large p {p,k €: Z). The original system (1) is reordered and rewritten into 
two-by-two block form A. Applying block-Gaussian elimination, it is reduced 
to solution of two systems with Ai^i and one system with the Schur comple- 
ment. Here Ai_i = blockdiag{Ai^'^)g^i, Ai^^ = /j, 0 T -|- 0 where = 

tridiag{bk,+i,k,+z-i ,bk,+i,k,+i ,h,jri,k,+i+i)i=i, ks = {s-l){k + l), s = 1,. . . ,p. 
Hence, each of the systems with Ai^i is equivalent to solution of p independent 



424 Gergana Bencheva 



(k) 

subproblems with matrices As , s = 1, . . . ,p. The system with the Schur com- 
plement is equivalent to incomplete solution of a system with the original matrix 
A and with a sparse RHS. The algorithm GM is summarized as follows: 
Algorithm GM 

Step 1. for s = 1 to p 

solve using Algorithm SM 

end {loop on s} 
compute — ^ 2,1 

Step 2. solve incompletely Ax = f, ^ 0 for i = s{k + 1), (fi = ) 

seeking only 9.s(^k+i) s=l,...,p-l; 

Step 3. compute 

for s = 1 to p 

solve a1^^ using Algorithm SM 

end {loop on s} 

End {GM}. 

Step 2 of Algorithm GM, as it was proposed in [8], is handled by Algo- 
rithm FSV (only Ip = logp steps are needed). We refer to this variant of GM 
as Algorithm GMF. If we use Algorithm SRHS instead of FSV, we will get 
the more expensive Algorithm GMS. Its advantages are discussed below. 
The Computational Complexity of the FSV, GMF and GMS algorithm is 
respectively Afpsv ~ 24nm(logm — 1) — 9nm, Ngmf ~ 62mn + 24nm(logp — 
1) — 9nm and Mgms ~ 62mn -|- 4pnm + nm. These expressions are based on 
the arithmetic cost AfsRHS ~ 2(r -|- d)nm + 5nm of SRHS. Algorithm GMF 
is always slightly faster than Algorithm FSV. For some sizes of the discrete 
problem (4 < Z < 8, m = 2^ — 1) Algorithm GMS is faster than both FSV and 
GMF, but asymptotically it is the slowest one. 

3 Parallel Implementation 

How are the data and the computations distributed among the processors? What 
are the type and the amount of communications inspired by this division? How 
good is the developed solver? These are the questions treated in this Section for 
each of our methods. The parallel implementations of the algorithms FSV, GMF 
and GMS are referred as PFSV, PGMF and PGMS respectively. 

Initial Data in Each Processor. Let us have NP = 2"^’ processors enumer- 
ated from 0 to NP — 1. Let us also assume that m -I- 1 = 2^ = p{k -|- 1), p = 
2*P, A: -I- 1 = 2*^', I = Ik + Ip, {I, Ip, Ik G Z). For PFSV and both variants of 
PGM, the computational domain is decomposed into number of strips equal to 
the number of processors. The initial data is generated in two different ways 
depending on the method. Each processor contains the whole matrix T and 
the entries of the matrix B corresponding to one or more successive strips (see 
Fig. 1). For PFSV and PGMF also parts of the RHS have to be duplicated, while 
for PGMS each processor contains the components of f only for one strip. The 
first advantage of PGMS is the less required memory for the initial data. 
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LSTRIP LSTRIP-1 LSTRIP LSTRlP-1 




PFSV, PGMF PGMS 



Fig. 1. Initial data in each processor, LSTRIP = 2* NP = 8 



Parallel Implementation of FSV. Both FR and BR are divided into two 
stages. First one corresponds to the values of k for which sequential SRHS is 
used. The second one is for the case when the size of the subsystems solved 
incompletely is larger than LSTRIP and a parallel implementation of SRHS 
(PSRHS) should be used to improve the load balance of the computer system. 
All the processors have to complete equal amount of computations since for a 
given k in fact they have to solve subsystems with SRHS at the first 

stage and one subsystem with PSRHS at the second one. The communications 
for the first stage of FR are local ~ one entry of the solution and one component 
of the RHS have to be sent from the processor myid to myid— 1. For the second 
stage the same data have to be transferred, but now between processors which 
are not neighbours. Additional global communications are required in PSRHS. 
For the case k = I — 1, . . . , I — np + 1 of BR, except communications for PSRHS, 
only one component of the solution should be transferred, but to all processors 
which will need it. The other stage is completed without any communications. 

Parallel Implementation of SRHS is required in all of the considered solvers. 
The input and the output data of PSRHS are handled in two slightly different 
ways depending on the algorithm. Hence some additional communications are 
to be performed in the version for PFSV and PGMF. 

In the preprocessing Step 0 the master solves the eigenproblem and dis- 
tributes the data in the required way. The computations and communications 
at Step 1 are the same as if we have to perform matrix x matrix multiplication. 
The data partitioning for this multiplication is: left matrix is divided in strips by 
columns, right - by rows, and the product should be distributed among proces- 
sors again in strips by columns. Global collective communications {scatter and 
reduce = gather + broadcast) are required at this stage. Gomputations are di- 
vided into equal parts. For the next Step 2 each processor have to solve m/NP 
independent tridiagonal systems without any communications. The third step 
is performed in similar way as Step 1 with the only difference that the second 
matrix and the result are now column vectors. 

Parallel Implementation of GM. The first and the last steps of both algo- 
rithms PGMF and PGMS are solution oip/NP systems with Algorithm SM, 
i.e. all the processors perform equal amount of work. To compute the related 
RHS one vector of size n has to be transferred to one of the neighbours. Step 
2 is handled either by PFSV or by PSRHS, which were discussed above. Note 
that the number of stages in PGMS at which communications are needed is a 



426 Gergana Bencheva 



Table 1. Results for the Sequential Algorithms 



Platform 


n 


FSV 


GMF 


GMS 








CO 

II 


II 


CO 


II 




255 


3.12 


3.02 


2.90 


1.99 


1.02 


Grendel 


511 


19.01 


19.29 


11.66 


18.64 


7.94 




1023 


87.62 


87.49 


73.61 


117.79 


86.03 




255 


6.91 


6.95 


4.02 


4.48 


2.53 


Beowulf 


511 


28.57 


28.42 


16.88 


23.04 


12.79 




1023 


121.33 


120.67 


74.45 


133.99 


72.65 



constant, while in PFSV and PGMF it depends on both size of the problem and 
used processors. 

4 Numerical Tests 

We implemented the presented parallel solvers in C using MPI standard. The 
achieved performance on two coarse-grained parallel platforms, referred to as 
Grendel and Beowulf, is analyzed. Grendel is a Silicon Graphics Origin 2000 
with 8 RIOOOO processors at 250MHz, 4MB L2 cache and 4GB of RAM. Beowulf 
is a cluster of 16 Digital (Gompaq) Personal Workstations SOOau with a single 
EV56 processor at 500MHz and 128MB RAM per node, connected via 100Mbps 
switched Ethernet. To illustrate the properties of the considered algorithms and 
the related code, we present here results obtained for the following 
Example: The coefficients of the initial bvp (see Section 2) are ai(a;i) = 1-1- 
Xi, a 2 (x 2 ) = . As a solution is taken u(xi, X 2 ) = (1 — Xi)xiX 2 (l — X 2 ) and 

the RHS corresponds to the above data. 

The related discrete problem has N = n? degrees of freedom. It is obtained by 
five-point difference scheme applied on uniform nxn mesh with mesh parameter 
h = l/{n+ 1), i.e. m = n, n -I- 1 = p{k -|- l),p, fc € Z. We report below the best 
obtained times in seconds from 3 executions of the code on a given machine. We 
compare in Table 1 the behaviour of the sequential algorithms on both platforms. 
First column tells the machine, and the second one - the value of n. The third 
column presents the cpu-time for the FSV algorithm. Next four columns are 
grouped by two and give the times for GMF and GMS for two different values of 
k. For larger k the standard marching algorithm for this example is not stable. 
How the stability of SM affects the error for GMF is shown in [3]. We see that 
the times on Grendel are smaller in general than those on Beowulf, but the 
behaviour is similar. Times for fc = 7 for both GMF and GMS are better than 
those for fc = 3. For small sized problems the algorithm GMS is the fastest one. 
But, as it was expected, for n = 1023 it is the slowest one. 

The measured times for the related parallel algorithms and the achieved 
speed-up and efficiency are reported in Table 2. Again the first column presents 
the platform and the second one characterizes the size of the problem -- n is 
given where the number of unknowns is N = n'^ and k = 7. The third column 
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Table 2. Results for the Parallel Algorithms, k = 7 









PFSV 




PGMF 




PGMS 


Platform 


n 


NP 


Tnp 


Snp 


Enp 


Tnp 


Snp 


Enp 


Tnp 


Snp 


Enp 






2 


2.47 


- 


- 


1.60 


- 


- 


1.07 


- 


~ 




255 


4 


1.42 


1.74 


0.87 


0.89 


1.80 


0.90 


0.63 


1.70 


0.85 






8 


1.08 


2.29 


0.57 


0.65 


2.46 


0.62 


0.37 


2.89 


0.72 






2 


16.86 


- 


- 


10.52 


- 


- 


7.59 


- 


- 


Grendel 


511 


4 


8.85 


1.91 


0.96 


5.73 


1.84 


0.92 


3.95 


1.92 


0.96 






8 


4.98 


3.39 


0.85 


3.28 


3.21 


0.80 


2.24 


3.39 


0.85 






2 


79.59 


- 


- 


49.53 


-- 


- 


50.96 


- 


- 




1023 


4 


41.63 


1.91 


0.96 


26.13 


1.90 


0.95 


24.00 


2.12 


1.06 






8 


29.38 


2.71 


0.68 


16.19 


3.06 


0.77 


12.49 


4.08 


1.02 






2 


8.94 


- 


- 


6.31 


- 


- 


2.29 


- 


-- 




255 


4 


3.90 


2.29 


1.15 


2.72 


2.32 


1.16 


1.74 


1.32 


0.66 






8 


2.31 


3.87 


0.97 


1.55 


4.07 


1.02 


1.11 


2.06 


0.52 






2 


47.52 


- 


- 


36.02 


- 


- 


13.28 


- 


- 


Beowulf 


511 


4 


23.40 


2.03 


1.02 


17.83 


2.02 


1.01 


8.51 


1.56 


0.78 






8 


15.44 


3.07 


0.77 


13.35 


2.70 


0.67 


8.14 


1.63 


0.41 






2 


271.13 


- 


- 


227.04 






89.46 




- 




1023 


4 


165.57 


1.64 


0.82 


148.74 


1.53 


0.77 


50.45 


1.77 


0.89 






8 


126.88 


2.14 


0.53 


125.18 


1.81 


0.45 


41.07 


2.18 


0.55 



shows the number NP of the used processors. The rest of the columns are in 
groups of three - one for each algorithm. In each group first column stands for 
the measured cpu-time, the second is for the speed-up S^p = and the third 
one is for the efficiency Ejqp = . The theoretical upper bounds for the 

speed-up and the efficiency for these cases are Snp < and Epip < 1. 

Let us first see what happens on Grendel. The tendency for each of the 
algorithms is that S^p and Emp increase with the problem size, and i ?4 > Es- 
There is an additional penalty in the speed-up (and the efficiency) for NP = 8 
in the case n = 1023 because Grendel has only 8 processors. The algorithm with 
best speed-up for NP = 4 is changed when the value of n is varied. For the case 
NP = 8 the “winner” is always PGMS. The behaviour of the speed-up and the 
efficiency on Beowulf is completely different. For algorithms PFSV and PGMF 
the efficiency decreases for larger values of n, while for PGMS it increases. These 
results demonstrate how the type and the amount of communications could 
change the real parallel performance of a given solver on different platforms. The 
“winner” is again changed, but now PGMS is the looser for n = 255, 511 and 
NP = 4, 8. For large sized problems, i.e. n = 1023, PGMS has the best efficiency 
for both NP = 4, 8, although it is not close to the theoretical upper bound. The 
additional influence of the factors like the memory access architecture explains 
the superlinear speed-ups. The cpu-times for PGMS on both machines are the 
smallest for all values of n and NP. 



428 Gergana Bencheva 



5 Concluding Remarks 

The parallel performance of three direct separable solvers was compared. The 
behaviour of the speed-up and efficiency of PFSV and PGMF was completely 
different on the different platforms we considered. For PGMS it was one and the 
same, although the values on Beowulf were smaller. The prediction that on some 
machines the slowest sequential solver will have the best parallel performance 
was confirmed. 

Plans for future work on this topic include: a) MPI/OpenMP modification 
of the code (PFSV and both PGM); b) more experiments on (coarse- and fine- 
grained) shared memory, distributed memory and heterogeneous systems; c) gen- 
eralizations of the considered solvers to 3D case; and d) development of efficient 
sequential and parallel preconditioners on the base of FSV and both GM. 
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Abstract. In this paper we develop monotone finite volume difference 
schemes for a two dimensional singularly perturbed convection-diffusion 
elliptic problem with interface. Theoretical results and numerical exper- 
iments for fitted mesh (Shishkin’s mesh) approximations are presented. 



1 Introduction 



We seek a solution u G C{f2) n C^{f2 U f?'*') of the problem 
Lu = —eAu + div{au) + bu = f, (x, y) G fl~ U I?'*', = g, 



M r = '“(^ + 0) y) - - 0, y) = 0, - 



du 



Hru(^,y) = Q(y), on A, 



J r 



( 1 ) 

( 2 ) 



where 0 < e « 1, I?" = (0,^ x (0,1), 17+ = (^,1) x (0,1), 17 = f?" U 17+, 
r = {x = G {Q, 1)}, a = {ai{x,y),a 2 {x,y)) , |a| > ( 01 , 02 ) > (0,0) and 



b{x,y) — -div a{x,y) > /? > 0 for (a;, y) G 17 U 17+. 



(3) 



The functions a, b and / could be discontinuous on the segment F. 

This problem may be viewed as modeling a steady-state convection-diffusion 
process. The importance of accurate numerical methods lies in the fact that it 
can also be regarded as a simple linear model of the Navier-Stokes flow equations. 
Boundary and interior layers are normally present in the solutions of problems 
involving such equations. These layers are thin regions in the domain where the 
gradient of the solution steepens as the singular perturbation parameter e tends 
to zero. 

While there is a vast literature dealing with convection-diffusion problems 
of type (1) with smooth coefficients and smooth right hand side - see e.g. [8, 
9] , we are aware of only a handful of papers where discontinuous input data are 
discussed [3,5,7]. 

Note that the sign pattern of the convection coefficients is essential for the 
solution behaviour. The reduced problem is first order discontinuous hyperbolic 
problem. As a result boundary and interior layers appear near the outer bound- 
ary of 17“ and 17+ . We concentrate on the case when 



^ = 0.5, ai{x,y) > 0, in 17, a 2 {x,y) > 0, in 17 , a 2 {x,y) < 0, in 17+. (4) 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 429-437, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In view of (3) the problem (l)-(4) satisfies the maximum principle. We assume 
also that the solution of the problem (l)-(4) satisfies the following estimates 



Proposition 1 Let the eoefficients of problem (l)-(4-) are suffieiently smooth 
and satisfy all necessary compatibility conditions, that guarantee u S C(l7) H 
C^{L2~ U Then the solution u can be decomposed into regular part V that 

satisfies for all k, l-integer, k,l = 0, . . . k + I < 5 the estimates 

\D^DlV{x,y)\ < C, {x,y) S 17" U 17+, (5) 

and singular part W 



[X, y) I ^ ^ E-y+{x, y), (x, y) e I7+, 



( 6 ) 



that satisfies 

\D^DyE^~ {x,y)\ < Ce"^ exp (—mi (0.5 — x)/e ) , 

\D^D'yEy-{x,y)\ < C'e"'exp(-m 2 (l -y)/e) , 

\D^DyE^y~{x,y)\ < C'e"'""' exp (- (mi (0.5 - x) + m 2 (l - y)) /e) , 
|77^Z?'if"+(x,y)| <C£-'=exp(-m 3 (l-x)/e), (7) 

\D^DyEy~^{x,y)\ < Ce"' exp (-m4y/e) , 

\D^^Dl^E-y+ix,y)\ < Ce-'=-'exp(-(m3(l-x)+m4y)/e), 

where C, mi, m 2 , m 3 , m 4 are independent of e positive constants. 



2 Numerical Solution 

2.1 Grid and Grid Functions 

It is well known that in singularly perturbed problems can not be achieved an 
uniform convergence on uniform meshes. In order to obtain £-uniformly con- 
vergent difference scheme we construct a partially uniform mesh w condensed 
near the boundary and the interior layers, see Fig. 1. Denote by w~ the mesh in 
17“ and by w~^ the mesh in 17“'". In 17“ we construct a piecewise uniform mesh 
(Shishkin’s) condensed closely to the interface x = 0.5 and the boundary y = 1. 

w" = {{x^,yj), Xi = Xi-i + h^, yj = yj-i + h^, 

i=l,.. .,Ni,j = 1, . . . ,Mi, xq = 0, XNj = 0.5, yo = 0, yMi = 1} , 
hf = 2(0.5 - Sf)/Ni, i = 1, . . . , W/2, hf = 2Sf/Ni, z = W/2 + 1, . . . , iVi, 
h^ = 2(0.5 - 5f)/Mi, j = 1, . . . , Mi/2, h^ = 26\/M,, j = Mi/2 + 1, . . . , Mi, 
= min {2£ln A^i/mi, 1/4} , = min {2£lnMi/m2, 1/2} . 

Similarly in we construct a piecewise uniform mesh (Shishkin’s) condensed 
closely to the boundaries x = 1 and y = 0. 
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= {{Xi+Ni.Vj), X^+N^ = X,+Ni-l + Kl+N^^ Vj = yj-l + hp'i-= 
j = 1, . . . , M2, XNi = 0.5, XN1+N2 = 1, yo = 0, 2/M2 = 1} , 

= 2(0.5 - S^)/N 2 , 1 = 1 ,..., N 2 / 2 , 
h-+N, = ^5^/N2, i = N2/2 + 1,..., N 2 , 

= 2b\lM2, 2 = 1,..., M 2 / 2 , = 2(1 - 51)/ M 2 , j = M 2/2 + 1, . . . , M 2 , 

^2 = min {2£rln7V2/m3, 1/4} , 5\ = min {2elnM2/m4, 1/2} . 

On the interface we consider the union of the projection of each mesh on the 
interface uj~ 

For each point (sj, i/j) of uj we consider a control volume the rectangle Cij = 
e{xi,yj), see Fig. 1,2. There are four types of grid points, boundary, regular and 
irregular interior points and interface points, see Fig. 1. 



(0,1) 



(0.5,1) 



(0,0) 













, ,r 










' 1 


— 



























( 1 , 1 ) 



(0.5,0) 

o - boundary points, • - regular points in ut and , 
o - irregular points in io~ and , * - interface points. 



(1,0) 



Fig. 1. Grid with local rehnement on boundary and interior layers 



Let v,w,g are given grid functions of a discrete arguments (xi,yj) G uj. 
Denote gtj = g{xi,yj), gNi±o,j = g{xN-i ± Further we shall use the 

standard notations 



UX I XX 

- ^ h, + h.+i 



, 9Ni,j = 



+ + ^Ni9Ni-0,j 



2h 



5 — 



Ni 



Vi,j Vi-Ij 



^i,j 

^x,i,j — 7 "^ ’ — '^x,i-\-l,j ■J '^xx,i,j — 



hf 



and similarly ones for y. Let N = N 1 + N 2 and Mint is the number of intervals on 
the interface. Denote by Mi the number of intervals corresponding to Xi and by 
and so on the corresponding mesh steps with respect to y. Clearly Mi = Mi 
for i = 0, . . . , 7Vi — 1, Mi = M 2 for z = + 1, . . . , iV and Mi = Mint for z = iVi. 
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We shall use the following discrete scalar product 

N-lMi-l 

{v,w)o,cj='^ hihY-Vi^jWij (9) 

i=i j=i 



and the norms 



|u||o,<^ 




N-l Mi-1 

E E 

i=i j=i 



||w||oo,a; = max \Vij\. 



(10) 



2.2 Finite Difference Approximation 

The finite difference approximation is derived from the balance equation. Inte- 
grating the equations (1) over cell Cij that does not interact the interface F and 
then using Green’s formula and dividing by we get 

+ if{x,y)-b{x,y)u{x,y))dxdy, (11) 

hj/lj’ ddetj Jetj 

where W = — eVit(s).n, V = a{s).nu{s), and n is the unit normal vector to the 
boundary of e^- . 

Let now the rectangle Cij interact the interface F, and are the left 

and the right part of Cij respectively. Denote by Sp,i,j the intersection of Cij 
and F. Using the interface conditions (2), we obtain 



dfdf 



dfdY 



{W + V)ds ■ 



{W + V)ds 



_ de-i\Sr.i.j Jde+j\Sr.i.: 

/ if - bu)dxdy + (/ - bu)dxdy + / Q{y)dy 

J e- - J et ■ J Sr i -i 



■ (12) 



We shall consider three difference schemes: upwind (UDS), modified upwind 
(Samarskii’s) (MUDS) and Il’in’s (IDS), see [6]. Denote by U the solution of the 
discrete problem. At the interior points in w that do not lay on the interface, we 
use the approximations 



( 13 ) 



= -eVa (4v^C/)^_^. - £ + Vs 

+ + (p2 -I- bijUij = fij, 

At the interface points we get the approximation 
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^ Nx,j V V ■'v,Ni,j 



X 

Ni 

The definition of /, 6, pj’* is given in (8). Here I = 1 stands for UDS, 

/ = 2 for MUDS, / = 3 for IDS and 



OLm.i.i 



I 



m,z j I : ^ 1 

1 = 3, 



r 1 , / = ! 

t (1 + ^ ^ = 2, ~ 

Rm,i,j COth(/?77j,^j j' ) , I — 3, 

_ f I, / = 1, 2, _,/ _ r a2,2,j+i + |a2,2,j+i I, ^ = 1, 2, 

“ \ ai.*+i,j, I = 3, ^2.zj - I a2,*.j+i, Z = 3, 

0^1, 2, j yj^ tt2,2,_7 — ^ 2 (^ 2 ? 



QTl 

'^i,j 



Q22J 



^\i/2 
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Fig. 2a. Control volume 



Fig. 2b. Irregular grid points 



At the regular points we have 

^ xRi,j Ux,i,ji ^ xRiJ Ux,i,j'! ^ xUi,j Ux,iJ' 

For ease of exposition we shall describe the approximation at the irregular points 
in the particular situation of Fig. 2b. We shall use the ideas from [2, 4] . Consider 
the control volume Cij corresponding to the irregular point (xi,yj). From the 
figure we see that it has a common border on the left with more than one cell. 
We set the requirements that the finite difference scheme conserves the mass. 
For example, in the particular situation of Fig. 2b we have 

(W + V)dy 



y.-hf;-^l2 



/ {W + V)dy= / 

'S”, Jy,-hYl2. 



ryj+Y+i/'^ 



lyj+hf\Y/^ 



{W + V)dy. (15) 



(VF+ U)dy- 

Taking into account (15) we approximate with a linear combination of 





,c/,_ 


Ui-lji ~ 


1 
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hf - '' 
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Thus we obtain the following expression for V xUij 



^ xUij — 



E 



Hi (C/.,, - 



Jk ^ 



■3k ) 






where rij is the number of neighbour cells (on the left) of the cell and 
is the length of the intersection of and Similarly are defined \7xUij 

and VxUij. 

Setting the boundary conditions 



U\o:.=9, (16) 

we obtain the finite difference problem (FDP) (13), (14), (16). The FDP can be 
written as a system of linear algebraic equations 

AU = F, {xi,yj) € w, (17) 

where in the right hand side F we have taken boundary conditions (16) into 
account. Using a similar argument as in [6], we can prove the following. 

Proposition 2 The FDP (17) satisfies the discrete maximum principle and the 
corresponding matrix A is an M-matrix. 



3 Uniform Convergence 

The Green’s function G of the operator associated with the mesh node 
€ W is defined by 



L^G{xi,yj,£,k,r]i) = S’"{xi,^k)S’"{yj,r]i) onw, G = 0 on duj, 
where S^{xi,^k) and S^{yj,r]i) are the discrete analog of the delta-function: 



S^{xi,fk) 



l/h( if i = fc, hf ^ f if Vj = Vh 
0 otherwise, I / |g otherwise. 



G as a function of the ’second’ argument satisfies for any (xi,yj) € co 



L’^*G{x^,yj,^k,Vi) = S^{xi,£,k)S^{yj,r]i) on w, G = 0 on du;, 

where L^* is the adjoint operator of L^. Using the Proposition 2, similarly as in 
[1] we can prove the following 

Proposition 3 The discrete Green’s function is negative and satisfies 

Mk-l Ni 

max G{x^,yj,^k,m)h(''" < max V G(a:*, 

1 — 1 k—1 

N-1 

^ ^ < CX2 ■ 

k^Ni 



max 



(18) 
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Proposition 3 yields the following a-priori bounds 

Proposition 4 Let L^v = /o + /i + /2 with arbitrary mesh functions fo, fi, f 2 
and V satisfies zero boundary conditions. Then 



i=l 

^int — 1 



N-l Mi-1 

'■ ■■ 



M 2 — 1 



^ ^ ■’ ^ ^ 2 — 1 

t=i i=i 

-I- min{o^^, 



(19) 



Now using Proposition 4, similarly as in [8] we can prove the following theorem: 

Theorem 1. Let u be a solution of the differential problem (l)-(4-) which satis- 
fies Proposition 1. Let U be a solution of the discrete problem (17). Then if the 
scheme is UDS the following e-uniformly estimate holds 

\\U-u\\^^^ <C{N-HnN + M-HnM) . ( 20 ) 

If the scheme is MUDS or IDS, then for sufficiently large M and N, independent 
of £ holds 

||C/-w||oo,<.<C(iV-i+M-i), (21) 

where M = max{Mi, M 2 , Mint] o-nd C is a positive constant independent of the 
mesh and the small parameter e. 



4 Numerical Results 

Consider the problem (l)-(4) with boundary conditions g(x,y) = 1 and coeffi- 
cients 



ai{x,y) = l-f2a;, a 2 {x,y) = l-\-y, q{x,y) = l-\-xy, (x,y) S D , 
ai{x,y) = 2-\-x, a 2 {x,y) = y - 2, q{x,y) = 2-xy, (x,y) € D+, 

where the right hand side f{x, y) and the interface function Q{y) are chosen such 
that 



u = 



1 I [ l-exp(-(l-2a;)/E) 
I l-exp(-l/e) 

1 I / l-exp(-3(l-a;)/E) 
1 — exp(— 3/2£) 



i»"+. 



is the exact solution. This function exhibits typical boundary and interface layer 
behaviour. For our tests we take N = Mi = M 2 and e = 10“®, which is a 
sufficiently small choice to bring out the singularly perturbed nature of the 
problem. 
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Table 1 displays the results of our numerical experiments. We observe first- 
order convergence for IDS and MUDS, while for UDS we observe convergence 
that is slower than first order. The convergence rate is taken to be 

PN = log2 {\\En\\oc,,lu/\\E2n\\oo,lu) , 

where ||if 7 v||oo^,^ is the maximum norm error for corresponding value of N. 



Table 1. Error on Shishkin mesh 



N 


UDS 


PN 


MUDS 


pN 


IDS 


pN 


8 


2.9349e-l 


0.54 


2.9325e-l 


0.54 


2.8443e-l 


0.54 


16 


2.0149e-l 


0.82 


2.0121e-l 


0.82 


1.9590e-l 


0.81 


32 


1.1409e-l 


0.82 


1.1399e-l 


0.92 


1.1173e-l 


0.91 


64 


6.4486e-2 


0.81 


6.0059e-2 


0.97 


5.9289e-2 


0.96 


128 


3.6850e-2 


0.83 


3.0695e-2 


0.99 


3.0468e-2 


0.98 


256 


2.0723e-2 


- 


1.5492e-2 


- 


1.5416e-2 


- 



5 Conclusions 

The paper presented three FVM difference schemes: UDS, MUDS and IDS. The 
schemes were investigated with regard to their accuracy and uniform convergence 
on nonconforming Shishkin’s meshes. Concerning the rate of convergence MUDS 
and IDS are first order e-unifrom convergent, while UDS is slower. Thus the 
theoretical and experimental results show that the Shishkin’s meshes resolve the 
boundary and the interface layers even in the case of nonconforming near the 
interface meshes. 
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Abstract. In the present paper we consider a parameter identification 
problem associated with systems governed by nonlinear parabolic partial 
differential equations with delays. To be precise, a distributed parameter 
model describing spatially-dependent hepatic processing of the chemical 
compound named dioxin or TCDD is analysed. The associated inverse 
problem is formulated in an operator theoretic setting for a least squares 
optimization problem. Galerkin type approximations are used to dehne 
a family of approximate optimization problems. Finally, the convergence 
result of the parameter identification problem is tested numerically, using 
both exact and noisy data. 



1 Introduction 

Recently, numerous physiologically-based pharmacokinetic models have been de- 
veloped to describe the uptake and elimination of the environmental toxin named 
dioxin or TCDD (see, e.g. [1, 4, 5, 8, 9]). 

In this paper we concern with a parameter identification problem associated 
to a slightly modified version of the TCDD model proposed in [5, 8]. The model 
under investigation is presented in Section 2, together with the well-posedness 
result. In Section 3 we define the estimation parameter problem and present the 
convergence result of Galerkin approximations. Section 4 contains the numeri- 
cal scheme used in solving the parameter identification problem and numerical 
results. Finally, in Section 5 we present concluding remarks. 



2 The Mathematical Model. Well-Posedness 

Recently, a mathematical model has been developed in [5,8], to describe pharma- 
cokinetic and pharmacodynamic properties of TCDD. It consists of a nonlinear 
system of partial differential equations with delays. The convection-dispersion 
equation (1) presented below, is based on the work of Roberts and Rowland ([9]), 
and characterizes the transport of blood elements in the liver sinusoidal (blood) 



I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 438-447, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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region. The dimensionless spatial variable x takes on values in the range [0, 1]; 
cc = 0 corresponds to the liver inlet, while a; = 1 corresponds to the outlet. The 
model describes the dynamics of TCDD-binding with two intracellular hepatic 
proteins, the Ah receptor (3)-(4) and an inducible microsomal protein, CYP1A2 
(5)-(6). The induction mechanism is described in terms of the fractional occu- 
pancy of the Ah receptor at a previous time, t — Here, t — denotes a 
lag time representing the induction delay. In our modified version of the model, 
elimination in the liver (by metabolism and biliary clearance) is assumed to be 
a Michaelis-Menten process and also, the axial dispersion number, T>n, is now 
considered a spatial distributed parameter, T>n{x). 

The combined venous/arterial blood compartment, including a loss due to 
the uptake and elimination of TCDD in the rest of the body, is given by (7). 
A circulatory lag, Tc, accounts for the time delay in transport of blood elements 
from the exit of the liver to the venous measurement location. 

The mathematical system under consideration is defined as follows: 



(vb + Vd^) 

dCun 

dt 



_ 



fuB Cb), 
P 



Q 



OCb 

dx 



Pfu 



■Cb- 



I CuH 3 AhiC u H T Ah) 



Vh 

-\-k-iCAh-T — gPr{CuHJ Cpr) + k-2pPr-T , 



dpAh-T 

m 

OCah 

dt 

dC p, — T 

Ft 

dCpr 

dt 




9Ah{CuH: pAh) — k-iPAh-T , 



k-lpAh-T — gAh(PuH ; pAh) ~ kd(Ah)pAh + ks{Ah) , 
gPriCuH J Ppr) — k-2pPr-T , 

k-2pPr-T — gPr{PuHJpPr) ~ kd(Pr)Ppr + kg(p^) 

PAh-T{t -Tr,x) 

pAh{t - Tr,x) + PAh-T{t ~ Tr , x) ’ 

^ 1) - Pa{t)) + m - gm{Pa{t)) , 

* a 



( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 
( 7 ) 



PB{t,0) = Pa{t) , 

QCB{t,l)-QVN{l)^{t,l)=q2{t), (8) 

C(s, x) = <P{x) , S € [-Tr, 0] , 

where P = [PB,PuH:pAh-T,pAh,Ppr-T,Ppr,PaF ■ In equation (8), <P and Q 2 
are assumed known. 

The functions gAh , gpr and gm in equations (2)-(7) are saturating nonlinear- 
ities defined by gAh(u,v) = k+iuv gpr{u,v) = kp2uv and gm(u) = for 
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u,v € M, arising from the law of mass action in chemical kinetics. Tables 2 and 
3 contain descriptions for rate constants as well as physiological and biological 
parameters of the system. 

The result of existence, uniqueness, and continuous dependence for systems 
including the model above is summarized below, and was obtained using a weak 
or variational formulation of the problem statement (see, e.g. [3,5,8]). 

We consider the state spave V = V x x IR and Ti = x IR, where V = 
i^i(0,l) and H = We define i?i(0, 1) = {<f G H\0,1) : <p(0) = 0} , 

with Wnorm \(p\v = W\h , for all G y. The inner product in Ti is defined by 



where (•, •) denotes the usual L^(0, 1) inner product. 

We multiply the equation in (l)-(8) by a function (pi in a “suitable” class 
of test functions and integrate in space in the first six equations, followed by 
integration by parts in the first equation (1) only. 

The weak or variational formulation of the problem (l)-(8) is as follows: we 
seek a solution z{t) G V satisfying an appropriate initial condition z{0) and 



for all If G V, where f{t) = (-/(t) - a5q2{t)6i,0,0, ks(Ah),0,ks(Pr), , 

g{z) = { 0 , 9 Ah {Z2 ,Z 4 )+ gpr {Z2 ,Z^),~gAh (Z2 ,Z 4 ), gAh {Z2 ,Z 4 ), ~gPr {Z2 ,Ze), 
gpr{z2,Z(i), gm(z7))’^, gp^z) = (0, 0, 0, 0, 0, -Ipr Z4 ^ sesquilin- 

ear forms cr, ctd : V x V* — *■ fR are defined as 

cr('0, ‘p) = - O2'01, + (a4'0i ~ a3'02 + («4 - Pi) 

-{a2ip7Si,pi)vGV + {k-llp3,P3) 

+ {b2i>2 - + k^iips + fc_2i/’5 + bii>r),P2) 

+ {kd{Ah)i’4 - fc-l'03,<P4) 

+ {k-2'tp5, Pb) + {kd(Pr)'^6 - k-2lpb, Pd) + aill7P7 
aD{ip,p) = a [{i{ji{l),pi) + {ip7,Pi) - V'i(l)</^7 - '^PiPr] ■ 

Here, p' = dp/dx, and a, oi, . . . , 05, 61, 62 are constants depending on biological 
parameters of the system. We notice that the forcing term F{t) = (/(t), V^)v,v 
implies the pointwise evaluation of a test function at the right boundary (x = 1), 
by the presence of a Dirac delta function, i5i, in the first component of f{t). 

The variational formulation of the problem (l)-(8) can be written in operator 
theoretic form as follows: we seek a solution z{t) G V satisfying in V* the system 



6 




for ell p^ip G H , 



{z{t), p)v\v + cr{z{t), p) + aoizit - Tc), p) + {g{z{t)),p)-H 
+ {9D{z{t-tr)),p)H = if{t),p)vpv 



(9) 



z{t) + Az{t) + g{z{t)) + Aoizit - Tc)) + goizit - r^)) = F(t) 
z(s) = 2;o(s), se[-r,0j. 



( 10 ) 
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Theorem 1. ([3]) Under certain mild assumptions on the problem data {see [3]), 
there exists a weak solution z of (10) with z S L^((0,T), V) n C{[0,T],H) and 
z € T^((0, T), V*). Furthermore, the solution is unique and depends continuously 
on the data {zo,F). 

This result was obtained by first establishing well-posedness for the system 

z{t) + Az{t) + g{z{t)) = F{t) inV*, z{s) = zo , (11) 

and then showing that it could be extended to delay systems of the form (10) 
by the method of steps used in the study of delay differential equations. 

3 Parameter Estimation Problem and Convergence 
of Galerkin Approximations 

Based on the theoretical development described in Section 2, and since the model 
(l)-(8) is a special case of (10), which can be investigated via (11), we consider 
the following class of abstract nonlinear parameter dependent parabolic systems 
evolving in a real separable Hilbert space: 

z{t)+A{q)z{t)+g{q){z{t))=F{t;q), z{0) = Zq . (12) 

In this case the unbounded operator A, the nonlinear operator g, and the forcing 
term F are all assumed to be dependent on some parameter q, in contrast to the 
“parameter free” representation given in (11). 

The results presented here are based on the general parameter estimation 
formulation and results of Banks and Kunisch ([2]). In the estimation formulation 
of the system in (11), we consider (12) with the linear operator A, the nonlinear 
term g, and the input F parameterized by a vector parameter q that must 
be estimated from experimental data. In this case q takes on values from an 
admissible parameter set Q, which may be an infinite dimensional set. 
Parameter Estimation Problem. The parameter identification problem can 
be defined as follows: find q € Q which minimizes 

K 

J{<l^w) = ^\Cz{ti,-]q) , (13) 

i=l 

where z{ti, •; q) are the parameter dependent solutions of (12) evaluated at time 
ti and w = {wi}A^i corresponds to measurements taken at time ti ([6]). We con- 
sider Galerkin type approximations to (12) and define a family of approximating 
parameter estimation problems. 

This minimization procedure involves an infinite dimensional state space Ti 
and an admissibile parameter set Q. The parameter space is generally infinite 
dimensional as certain of the unknown parameters (for example, the exit flux 
q 2 {t)) involve spatial and/or temporal dependence. We proceed as in the works of 
Banks and Kunisch ([2]). Let Ti.^ be a finite dimensional subspace of H and 
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be a sequence of finite dimensional sets approximating Q. We then can formulate 
a family of finite dimensional estimation problems, with finite dimensional state 
spaces and finite dimensional parameter sets, as follows: find q € which 
minimizes 

J^(g,w;) = (14) 

where z^{t; q) G Ti^ is the solution to the finite dimensional approximation of 
(12) given by 

(i^, ip)v-y + {A{q)z^ ,if)vy + {g{q){z^),q}) = {F{t; q),(fi)v\V rx 
z^{0) = P^zo, ^ 

for all (fi G Ti^ , where is the orthogonal projection of Ti onto Ti^ . 

In the following, we assume that the conditions of Theorem 1 are satisfied, 
along with approximating assumptions for the finite dimensional subspaces Ti^ 
and (see, [4] for details). The convergence result for the approximate pa- 
rameter identification problems, which entail minimization of (14), is given by 
the following theorem: 

Theorem 2. ([4]) Let q^ he any sequence in such that q^ ^ q G Q. Then 
z^{t\q^) z{t;q) in Ti uniformly on [0,T], and z^(t;q^) — > z{t\q) in V for 
almost all t > 0, where z^ satisfies 

{z^{t),(fi)v.y + a{q^){z^{t), (fi) + {g{q^){z^{t), (p) = {F{t; q^), p)v\V 
z^{Q) = P^zo, 

for all p G Ti^ , and z satisfies, for all p GV 

{z{f),p)vpv + a{q){z(f),p) + {g{q){z{f) , p) = {F{t; q),p)vy 

For the model presented in Section 2, each of the parameters lies in some 
compact subset of Euclidean space. We state without proof that conditions in 
Theorem 2 hold for the TCDD model. This statement follows from proofs and 
arguments given in [4,8]. For the state spaces V and TI, the approximating con- 
ditions are satisfied for the choice of piecewise linear continuous basis elements 
for the finite dimensional subspaces Ti^ . 

A solution to the identification problem is guaranteed over each delay interval 
of length Tc and is continued forward in time by the method of steps ([4, 8]). 

4 Implementation and Numerical Results 

We discuss here our numerical approach. First, we present the method used in 
obtaining a numerical solution for the initial-boundary- value problem (l)-(8). 
The values for the system parameters are as given in [1,4]. The numerical scheme 
is based on the weak formulation of the problem (l)-(8) described in Section 2. 

Finite Element Formulation. Let 0 = a;o<a;i<...<a;Ar = lbea uniform 
partition of the interval [0, 1] into N subintervals of length h = 1/N . We take 
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as basis elements the standard piecewise linear continuous functions, ipj, j = 
0, . . . , defined by 






X — X-j — i 

h 

Xj-\-i —X 

h 



0 , 



Xj-I < X < Xj , 

Xj < X < Xj+l , 

0 < X < Xj-i or Xj+i < X < 1 . 



We define z = [Cb,Cuh , Cah-t, Cah, Cpr-x, Cpr, Ca]^ = [zi,.--, zr]'^ and the 
Galerkin finite element approximation by 



N N 

Zi{t,x) = '^a]{t)(pj{x) , zj^{t,x) = '^a]{t)ifj{x) , for i = 2, . . . , 6, (17) 
i=i J=o 

where ipj, are as described above and is the approximation for Zi. 

Next we substitute the finite element approximation (17) into the weak form 
of the equations. Let z^(t) G jf^JV+5(N+i)+i ^^(t) = [a^(t), 

. . a^(t), Z 7 (t)]^. The finite dimensional system we obtain, in terms of the 
time-dependent coefficients of the Galerkin approximations, is given by 

M^z^(t) = A^z^it) + g^(z^(t)) + A^(t) 

+ABZ-^(t-Tc)+gf(z-^(t-Tr)), (18) 

where the matrices and A^ are elements of jfj6(Ar+i)x6(Af+i) vector- 
valued functions Gp , , Ap, and are elements of 

The initial condition, z^ , for the semi-discrete problem (18) is taken as the 
projection of the original initial condition, <?, onto the finite element space. 

General Algorithm for the Forward Problem. The TCDD model includes 
two time delays, Tc and Pr, with Tc <C T-. We set Tc = 1 minute for the circulatory 
delay and = 6 hours for the induction delay. We perfomed successively nu- 
merical tests using the axial dispersion parameter, as a spatial distributed 
one, given by T>i^{x) = 10-1- 10^x(l — x) and Vn{x) = 10-1- 10^(x — 0.5)^. 

To see the effects of the protein induction delay, Tr, we compute solutions 
on the interval from zero to 24 hours. We assume the absence of TCDD in the 
system on the interval [—6, 0] hours. Thus, one can ignore the induction delay 
term, , over the first induction delay interval [0,Tr]. 

The entries in the matrices and A^ , consisting of certain combinations 
of inner products of basis elements and their derivatives, were calculated analyt- 
ically, as well as the nonlinear vector function . The nonlinear vector function 
was calculated using the numerical integration scheme quad in Matlab. 
The ’method of steps’ for ODEs was used to computationally solve the prob- 
lem (18) on successive circulatory delay intervals of length Tc, and again on 
successive induction delay intervals of . We first find the solution over the first 
delay interval [0,Tc], where there is no protein induction and we can ignore the 
induction delay term, . Then we find the solution over each subsequent in- 
duction delay interval; on these intervals, the protein induction has begun and 
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we must include . In order to evaluate z^{t — Tc) and z^{t — Tr) at a par- 
ticular time t, we store the computed solution throughout the previous delay 
intervals and then interpolate to find the value of the solution at time t — 
and t — Tr- Here, for ease of computation, we have assumed that the final time 
T is an integer multiple of the induction delay Tr and the induction delay is an 
integer multiple of the circulatory delay Tc- 



Algorithm (method of steps) 



1. Set T = final time (T < Tr) 

2. Form the coefficient matrices and 

3. Solve the linear system defining the initial condition Zq in the finite element 
space 



4. Set to = 0 and tf 

5. do while to <T 
- Solve on [to,tf] 



Tc, Tc<T 

T, otherwise 



M^z^{t) = A^z^{t) + Gp{z^{t)) + T{t) + AdZq {t - Tc) 
z^ {to) = Zo {to) 



- Set z^{t) = z^{t) 

- to = to + Tc 

t ^\'^a + Tc,to+Tc<T 

^ ) otherwise 

6. end 



To calculate the solution for the forward problem, we also used the Matlab 
code written by L.F. Shampine and S. Thompson (see, [10]) called dde23, a 
solver capable to find solutions for a large class of delay differential equations. 
Implementation Details for the Forward Problem. The computer code 
was written in Matlab version 6.1 and computations were carried out on a 
Pentium III personal computer. The Matlab routine odel5s, which is a variable 
order, variable step method based on numerical differentiation formulas, was 
used for time stepping. The relative and absolute error tolerances were set to 
1 X 10“®. Since this is a variable step method, at time t the solution over the 
previous delay interval had to be interpolated in order to determine the value 
of the solution at times t — Tc and t — Tr- To find the value of the solution at 
time t — Tc , Matlab’s interpolation routine interpl was used with spline option. 
The maximum order of the integrator was set to two in order for it to match 
the range of accuracy of both the interpolation routine and the finite element 
approximations. The value of the solution at time t — Tr was computed using 
linear interpolation. 

We denote Ipr the maximum induction rate of CYP1A2. In order to amplify 
the effects of the induction delay in the computed solution, we set Ipr equals to 
434.6 nmol/L/hr which is ten times the rate suggested by the literature ([!]). 
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Inverse Problem. We estimated Ip^ , the value of the maximum rate of synthe- 
sis of CYP1A2 in the presence of TCDD. holding all the other parameter values 
fixed. Since we do not have access to experimental data, we used the solution 
for the arterial blood concentration Ca from the forward problem simulations as 
our data. The observation operator C is the dot product of the solution at time t 
with the unit row vector in with a one in the last component. 

Our goal is to find the value for /p^ that minimizes the difference (in the least 
squares sense) between Ca{t, Ipr), the Ipr-dependent solution from the forward 
problem, and our data. We let Ca{ti, Ipr) be the simulated solution of Ca at time 
ti, with ti in the time interval from zero to three hours. We denote the vector of 
data points or observations by Ca- Finally, Ip^ is the simulated solution of Ca 
with Ipr = Ipj. = 430; i.e., we take Ip^. = 430 as the true value in our simulated 
data. 



Table 1. Numerical results for data without noise and with uniformly (u), or normally 
(n) distributed noise 



Noise 


Initial 


Converged 


Converged Value 


True 


Cost 


Level 


Simplex xo 


Value Ipr 


Ipr with dde23 


Value ipr 


Function J 


- 


(420.5,420.51) 


430.8427 


430.9992 


430 


6.4332 X 10“'" 


- 


(430.5,430.51) 


430.0722 


430.0018 


430 


6.7289 X 10“'" 


- 


(440.5, 440.51) 


430.0493 


430.0025 


430 


6.3862 X 10“'" 


1% (n) 


(420.5,420.51) 


430.8263 


430.9991 


430 


4.6621 X 10“'" 


1% (n) 


(430.5,430.51) 


430.9311 


430.9144 


430 


4.8462 X 10“'" 


1% (u) 


(440.5, 440.51) 


430.6913 


430.8994 


430 


4.3847 X 10“'" 


5% (u) 


(420.5,420.51) 


430.2817 


430.0135 


430 


1.4836 X 10“'" 


5% (u) 


(430.5,430.51) 


430.4145 


430.2418 


430 


1.4829 X 10“'" 


5% (u) 


(440.5, 440.51) 


430.3849 


430.1634 


430 


1.4852 X 10“'" 


1% (n) 


(420.5,420.51) 


430.8425 


430.9273 


430 


1.6394 X 10“'" 


1% (n) 


(430.5,430.51) 


430.6639 


430.8994 


430 


1.3823 X 10“'" 


1% (n) 


(440.5, 440.51) 


430.7411 


430.8891 


430 


1.6305 X 10“'" 


5% (n) 


(420.5,420.51) 


430.6241 


430.9812 


430 


4.9483 X 10“'" 


5% (n) 


(430.5,430.51) 


430.0293 


430.0135 


430 


4.8305 X 10“'" 


5% (n) 


(440.5, 440.51) 


430.1069 


430.0481 


430 


4.9361 X 10“'" 



The algorithm for the inverse problem is quite standard. It consists of defining 
a least squares cost function and choosing an initial value (or values) for Ipr, 
the parameter to be estimated, which are then used in an optimization routine. 
Here we used a Nelder-Mead optimization routine ([7]). Since Nelder-Mead is a 
simplex search algorithm, we have chosen as initial simplex for /p^., an ordered 
pair of initial values (see. Table 1). The algorithm and inverse problem feasibility 
was tested with both noise free data and data containing 1 % and 5 % levels of 
noise. In the Nelder-Mead simplex search algorithm for the optimization, we set 
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Table 2. Nomenclature - Rate constants and lag times 



Abbr. 


Description 


k+i 


association rate constant of TCDD and Ah receptor 


k+2 


association rate constant of TCDD and CYP1A2 


k-i 


dissociation rate constant of Ah receptor-TCDD complex 


k-2 


dissociation rate constant of CYP1A2-TCDD complex 


k3 


apparent first-order metabolic clearance of TCDD {/hr) 


^d(Ah) 


rate constant for thermal inactivation of Ah receptor protein {/hr) 


^d(Pr) 


rate constant for degradation of CYP1A2 {/hr) 


km 


Michaelis constant {nmol/L) 


ks(^Ah) 


rate constant for synthesis of Ah receptor protein {nmol/L /hr) 


ks{Pr) 


rate constant for basal synthesis of CYP1A2 {nmol/L /hr) 


Tc 


circulation delay (hr) 


Tr 


lag time between TCDD binding to Ah receptor and 
cellular response of CYP1A2 induction {hr) 


X 


dimensionless spatial variable 

(a; = 0 corresponds to the inlet, a; = 1 to the outlet) 



Table 3. Nomenclature - Physiological and biological parameters 



Abbr. 


Description 


q2{t) 


exit boundary condition term resulting from flux balance {nmol /hr) 


Ca 


total arterial and venous blood TCDD concentration {nmol/L) 


Cah 


concentration of available Ah receptor protein in hepatocytes {nmol/L) 


Cah-t 


concentration of Ah receptor-TCDD complex in hepatocytes {nmol/L) 


Cb 


total concentration of TCDD in liver blood {nmol/L) 


Cpr 


concentration of available CYP1A2 in hepatocytes {nmol/L) 


Cpr-T 


concentration of CYP1A2-TCDD complex in hepatocytes {nmol/L) 


Cus 


concentration of unbound TCDD in liver blood {nmol/L) 


CuH 


concentration of unbound TCDD in hepatocytes {nmol/L) 


T>m 


axial dispersion parameter 


f^B 


fraction of TCDD unbound in the blood 


f^D 


fraction of TCDD unbound in space of Disse 


I{t) 


input concentration of TCDD at time t {nmol/L /hr) 


Ipr 


maximum rate of synthesis of CYP1A2 




in the presence of TCDD {nmol/L/hr) 


Vm 


maximum rate of elimination {nmol/L/hr) 


P 


permeability coefficient of the hepatocytes to TCDD {L/hr) 


Q 


volumetric flow rate of liver blood {L/hr) 


Qa 


volumetric flow rate of venous blood {L/hr) 


Va 


combined arterial and venous blood volumes {L) 


Vb 


liver blood volume {L) 


Vd 


Disse space volume {L) 


Vh 


hepatocyte volume {L) 
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the termination tolerance at 10“^ when no noise was present in the data and at 
10“^ in the presence of noise. 

We used three types of simulated data in our inverse problem: data without 
noise, data with uniformly distributed random noise, and data with normally 
distributed random noise. We compared our ability to estimate the value of the 
parameter when we added 1% relative error and 5% relative error. All three 
data sets were derived from C*{t), the simulated solution with the “correct” 
value Ipr = Ipr = 430. 

5 Concluding Remarks 

In this study we considered a parameter identification problem in a biochemical 
model. We focused on constructing a numerical procedure for this, based on 
understanding the background mathematical properties on delay PDEs with 
parameters, and on the convergence of corresponding Galerkin discretizations. 
A solver for the involved delay PDEs was given, and the feasibility of the final 
algorithm was demonstrated. Future work will be devoted to identification of 
Djv and Ipr, viewed as spatial distributed parameters, along with theoretical 
and numerical studies on convergence rates of the estimations. 
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Abstract. A new parallel algorithm for signal processing and a parallel 
systolic architecture of a CFAR processor with adaptive censoring and 
post detection integration (API) are presented in the paper. The pro- 
cessor proposed is used for target detection when echoes from targets 
are performed in conditions of binomial distribution pulse jamming. The 
property of the algorithm proposed is its ability automatically to de- 
termine and censor the unwanted samples corrupted by pulse jamming 
in both, the two-dimensional reference window and the test cell, before 
noise level estimation. In case of binomial distribution pulse jamming 
for big repetition frequency, the censoring capabilities of the algorithm 
offered from Behar is small and the probability of false alarm is not con- 
stant. We offer the vector at the output of the reference window to be 
sorted and censored again. In such a way the influence of pulse jamming 
environment over adaptive threshold is reduced to minimum. The sys- 
tolic architecture of the API CFAR is designed. Computational losses of 
the systolic architecture are estimated as a number of processor elements 
and computational steps needed for real-time implementation. 



1 Introduction 

Target detection is declared if the signal value exceeds the threshold. Conven- 
tional cell averaging constant false alarm rate (CA CFAR) detectors are proposed 
by Finn and Johnson in [3]. The presence of strong binomial distributed pulse 
jamming in both, the test resolution cells and the reference cells, can cause dras- 
tic degradation in the performance of a CA CFAR processor. In many practical 
situations, however, the environment is characterized by the presence of strong 
pulse jamming (PJ) with high intensity and binomial distribution. We assume 
that the noise in the test cell is Rayleigh envelope distributed and the target 
returns are fluctuating according to Swerling II model. 

* This work is supported by IIT - 010044, MPS Ltd. Grant “RDR”, Bulgarian NF 
“SR” Grant No I - 902/99 and Ministry of Defense Grant No 20. 
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There are a lot of methods for increasing the efficiency of CFAR processors 
in case of non-stationary interference. One of these methods is the use of ordered 
statistics for estimating the interference level in the reference window, suggested 
by Rohling [9], Rickard and Dillard [8]. Another approach for estimating the 
interference level, proposed by Himonas and Barkat in [6], is used in this paper. 
Later Himonas suggested in [5] this method to be used for removing of randomly 
arriving interference impulses from the test cell, when the reference window con- 
tains no impulse interference. Behar, Kabakchiev and Doukovska in [2] offered 
an adaptive censoring PI CFAR detector in the presence of pulse jamming and 
offered systolic architecture of this detector. When the probability for the ap- 
pearance of pulse jamming is high, and the sizes of the reference and the test 
windows are small, the censoring algorithm in [2] is unsatisfactory. 

In this paper we propose new modification of the parallel algorithm from [2] . 
After additional censoring, the estimate in the reference window of the parallel 
algorithm is similar to that of the optimal algorithm. In such a way the influence 
of pulse jamming environment over adaptive threshold is reduced to minimum. 
That leads to increasing of the probability of detection. 

The possibility for parallel processing of the samples in the reference window 
is used in the obtained parallel computing architecture of the target detection 
algorithm. Radar signal processing is performed by images (bit matrices) in 
real-time. It is known that the two-dimensional window is sliding over an image, 
passing the distance cells one by one. It is also known that systolic architec- 
tures are very appropriate for real-time implementation in signal processing. 
For the design of a systolic architecture we have used the “superposition” ap- 
proach according to which the known systolic architectures are used for each of 
the algorithm nodes. The designed parallel systolic architecture is suitable for 
conventional multiprocessor realization. 



2 Signal Model 

Let us assume that (L) pulses hit the target, which is modeled according to 
Swerling II case. The received signal is sampled in range by using (M -|- 1) 
resolution cells resulting in a matrix with (M -|- 1) rows and (L) columns. Each 
column of the data matrix consists of the values of the signal obtained for (L) 
pulse intervals in one range resolution cell. Let us also assume that the first {M/2) 
and the last {M/2) rows of the data matrix are used as a reference window in 
order to estimate the “noise-plus-interference” level in the test resolution cell 
of the radar. In this case the samples of the reference cells result in a matrix 
X of the size (M x L). The test cell or the radar target image includes the 
elements of the (M/2-1- 1) row of the data matrix and is a vector Z of the length 
{L). In conditions of binomial distribution of pulse jamming [1], the background 
environment includes the interference-plus-noise situation, which may appear 
at the output of the receiver with the probability 2e{l — e), the interference- 
plus-noise situation with the probability and the noise only situation with 
the probability (1 — e:)^, where e = 1 — ^(1 — tcF), F is the average repetition 
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frequency of PJ and is the length of pulse transmission. The distribution 
is binomial when the probability of PJ is above 0.2 as is given in [4]. In the 
presence of a desired signal from a target the elements of the test resolution cell 
are independent random variables with the following distribution law: 



fizj) 



(1 - e)^ / -Zj \ 2e (1 - e) 

Ao (1+s) \Aq (1+s)/ Aq (1 + Tj ■+ s) 

Aq (1 + 2rj + s) yAo (1 + Ir j + s)/ 






( 1 ) 

where s is the average per pulse value of the “signal-to-noise” ratio (SNR) at 
the receiver input, N = ML and Ag is the average power of the receiver noise, 
is the average per pulse value of the interference-to-noise ratio (INR) at the 
receiver input. The elements of the reference window are independent random 
variables with the compound exponential distribution law (1), setting s = 0. 



3 Analysis of API CFAR Processor 



The censoring algorithm consists of the following stages: 

The elements of the reference window x = {x\,X2, ■■■Xn) and the test resolution 
cell 2 : = (zi, Z2, ■■■Zl) are rank-ordered according to increasing magnitude. Each 
of the so ranked elements is compared with the adaptive threshold, according to 
the following rule: 



> 3^ 






7 ( 1 ) > „ZrpX 






( 2 ) 



^ ^ (1) ^ n) 

where sf = x\ ' and z\ . The scale factors Tf and TJ are determined 

l=l /=! 

in accordance with the given level of probability of false censoring as in 

[6]: The recursive procedure is stopped when the condition (2) becomes true. 
In this way the samples of the reference window and the test resolution cell are 
divided into two parts. The first part contains the “clean” elements, i.e. without 
pulse jamming. All these elements can be used for calculating the estimate V 
and the summed signal goi as in [2]. After the stop of the recursive procedure, 
it is assumed that most or all of the random impulses of pulse jamming are in 
the second part of the reference window and the test resolution cell. In this case 
the probability of target detection in the presence of binomial distribution pulse 
jamming may be calculated as in [2], using the following expression: 



N L 

Pd{s) = '^Pk^Pi 
k=l 1=1 




k + i — \ 
3 



(T„ -k s -k 1)'=+* 



( 3 ) 



where pk and pi are the probability that “fc” elements from the reference window 
and “F elements from the test window contain only the receiver noise before cen- 
soring. The probability for absence of pulse jamming in the element is (1 — e)^. 
The expressions for the probabilities pk and pi are: 
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B = ( ") (1 - ( 1 - (1 - ; p , = ( f ) (1 - .)»' ( 1 - (1 - E )-)'--' 

(4) 

The probability of false alarm is evaluated by (3), setting s=0. 

When we use a parallel algorithm with small sizes of the reference and the test 
windows, the achieved estimates of the environment in the API CFAR processor 
are not correct. 

We have offered to form a vector, which includes clean elements after cen- 
soring in the reference window. This vector must be sorted and censored again, 
because in some of the channels of this vector of the parallel algorithm the cen- 
soring is not good, especially when the repetition frequency of pulse jamming 
is high. In this case the estimate in the reference window in the new parallel 
algorithm is similar to that of the optimal algorithm presented in [2] . 

4 Numerical Results 

We study in this paper the censoring capabilities of an API CFAR detector in 
strong binomial pulse jamming. The average power of the receiver noise is Aq = 1. 
The results for the probability of censoring are received by using Monte-Carlo 
simulation. 

The probability of censoring of the offered new algorithm, for probability of 
false censoring from 10“® to 10“^, is presented on Fig.l. The probability for 
the appearance of pulse jamming is 0.6 and the size of the reference window is 
M=16 and L=16. The unwanted samples are censored more efficiently as the 
probability of false censoring increases, because the adaptive threshold in each 
step of the censoring process decreases. 



Reference Window 





Fig.l. Probability of censoring for P/c pjg, 2. Probability of censoring of new and 
from 10 ® to 10 ® algorithm 



The censoring capabilities of the suggested parallel algorithm in [2] and the 
new parallel algorithm, only for the reference window, are presented on Fig. 2. 
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The experimental results are obtained for the following parameters: size of the 
reference window M=16, L=16, 32, 64, probability of false censoring Pfc = 10“^, 
probability for the appearance of pulse jamming is 0.8. When the size of the 
reference window increases, the probability of censoring also increases. When we 
use additional censoring, the estimate in the reference window is similar to the 
estimate of the optimal algorithm considered in [2] . 




Fig. 3. Probability of censoring in the test Fig. 4. Probability of censoring in the test 
window for SNR=20dB window for SNR=10, 20, 30dB 



The censoring capabilities in the test window with size L = 16, 32, 64, 128, 256 
are presented on Fig. 3. The probability for the appearance of pulse jamming in 
the elements is 0.8. As expected, the probability of censoring increases when the 
size of the reference window increases. Censoring improves when the difference 
between the signal power and the interference power increases (over 3-5dB). 

The probability of censoring in the reference window with a size L = 64 
is presented on Fig. 4. The probability for the appearance of pulse jamming is 
0.3 and 0.8, for different values of the interference to noise ratio. Also, as the 
probability for the appearance of pulse jamming in the elements increases the 
probability of censoring in these samples for a given INR decreases (over 3-5dB). 

5 Parallel Architecture of an API CFAR Processor 

The systolic parallel architecture of a CA CFAR processor with adaptive censor- 
ing and non-coherent integration is presented on Fig. 5, and it is similar to the 
one presented in [2] . The computational blocks for sorting and censoring of the 
vectors are denoted as Si and C/, where I = 1...L + 2. 

The suggested on Fig. 5 parallel systolic architecture is also suitable for mul- 
tiprocessor realization. The systolic architectures of the sorting and censoring 
computational blocks are shown on Fig. 6 and Fig. 7. 

The systolic architecture of the sorting block is realized in accordance with 
the well-known “Odd - Even Transposition Sort” method. This method is used in 
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Fig. 5. Systolic architecture of a CA CFAR processor with adaptive censoring and 
non-coherent integration 




Fig. 6. Systolic architecture of the sorting algorithm 



the systolic architecture of adaptive MTD processors described in [7]. The anal- 
ysis of the architecture shows that four types of processor elements are needed 
for the realization of an API CFAR processor: PEI -h PE4 as in [2]. 



6 Estimation of the Systolic Architecture Parameters 

We use the following basic measures to evaluate the systolic architecture param- 
eters: the number of processing elements, the number of computational steps 
and the speed-up of the computational process. The computational measures of 
the processor, calculated for each stage of signal processing, are as follows: 

1. Sorting of vectors: 

Number of elements PEI = [L{M — 1)^ + {L — l)^]/2; PE2 = L{M — 2) + (L — 
2) -h i:>i; where Di = (M - i), if M > L or Di = L{L - M), if M < L. 
Number of steps: Ti = M, if M > L or Ti = L, if M < L. 

2. Censoring of vectors: 

Number of elements PE3 = L{M + 1); PE2 = L[M{M — 1) -|- (T — 1)] -I- Z?i; 
Number of steps T 2 = M, if M > L or T 2 = L, if M < L. 
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Fig. 7. Systolic structure of the censoring algorithm 



3. Sorting of sums: 

Number of elements PEI = {L — 1)^/2; PE2 = 2{L — 1); 

Number of steps T^ = L. 

4. Censoring of sums: 

Number of elements PE3 = L; PE2 = Number of steps T 4 = L. 

5. Comparing: 

Number of elements PEA = 1; Number of steps T 5 = 1. 

Consequently, the computational measures of the systolic architecture are: 
Total number of processor elements. 

NpE = NpEl + NpE2 + NpE3 + NpEi = — ^ ^ ^ + 2(£)i — 1) (5) 
Total number of computational steps. 



To = Ti + T2 + T3 + T 4 + Ts 



(2(M + T) + 1) if M>L 
(4L +1) if M <L 



( 6 ) 



When the algorithm of the API CFAR detector is realized with a conventional 
processor, then the type and the numbers of operations are as follows: 

1. Comparing: ^ L{L-if + 1) - 1 

2 . Assignment: N 2 = + SLiL-if 5 

3. Summarizing: A 3 = L{M + 1) — 2 

4. Multiplication: = L{M + 1) — 1 

If we assume that these operations are carried out for equal periods of time 
T = 1, then the total number of computational steps is: 



= 2ML{ML - 1)2 + 2L{L - l)^ + 3T(A/ + 1) + 1 (7) 



Consequently, when a parallel systolic structure is used in an API CFAR pro- 
cessor, the signal processing is speeded up with: 



= ^ » 1 
4o 



( 8 ) 
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where is the speed-up of the computational process. For example, when 
M = L=W, the Kup = 512320. 



7 Conclusions 

A new parallel algorithm of an API CFAR processor, for target detection in the 
presence of strong pulse jamming, is studied in this paper. The analysis of the 
censoring used in the new parallel algorithm shows that this algorithm is very 
effective when the intensity of pulse jamming (power and repetition frequency) 
is very high. The new parallel algorithm works successfully in the presence of 
binomial distribution pulse jamming. 

We have offered to form a vector, which includes clean elements after cen- 
soring in the reference window. This vector must be sorted and censored again, 
because in some of its channels the censoring is not good, especially when the 
repetition frequency of pulse jamming is high. 

The parallel algorithm proposed is realized on a parallel systolic architecture. 
It can be used for VLSI realization and in multiprocessor architectures. The using 
of the new algorithm leads to additional paralleling and speeding-up of signal 
processing. The computational losses of the suggested systolic architecture are 
evaluated by the number of processor elements and the number of computational 
steps needed for real-time implementation. 
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Abstract. In this paper we consider elliptic problems with variable dis- 
continuous coefficients and interface jump conditions, in which the so- 
lution is continuous, but the jump of the flux depends on the solution. 

A new numerical method, based on immersed-boundary approach com- 
bined with level set method, is developed. Using regular grids it is robust 
and easy to implement for curvelinear interface problems. Numerical ex- 
periments are presented. 

Keywords: elliptic problems, immersed interface method, discontinuous 
coefficients, Cartesian grid. 

1 Introduction 

Many numerical methods designed for smooth solutions of ODE or PDE do not 
work for interface problems. The interface problems have solutions, which (or 
their derivatives) may be discontinuous across one or several interfaces within 
the solution domains. The usual jump (interface) conditions are 



where F is arbitrary smooth curve in the domain and (3 may have discontinuity 
on r. 

Recently many people have worked in this area: the “immersed boundary” 
method proposed by C. Peskin, [10], the “immersed interface” method (IIM) 
developed by Z. Li, [6], appropriate FEMs, [2]. 

The “boundary condition capturing” method (BCCM), [7] uses the “ghost 
fluid” method (GEM) to capture the boundary conditions. They are robust and 
simple to implement even in three spatial dimensions for the general case of 
jump conditions. The BCCM is a first order accuracy method. 

Convergence analysis of the IIM was done in [3] and of the GEM - in [8] . 

In this paper we consider another kind of jump conditions: 



Problems of this type are often encountered in material sciences, biology and 
fluid dynamics, [1, 12]. IIM for such interface problems was studied in [4, 5]. 

I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 456—464, 2004. 
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[m]j. = ri(a;), [(3un]r = w{x), x £ F, 



( 1 ) 



[m]j. = 0, [/3Un]p = K{x)u{x), X G F. 



(2) 
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2 Finite Difference Scheme for ID Model Problem 

Consider the model equation 

{(3u^)x-ku = f{x) , X G (0,^) U (C, 1) , 

M? = 0 , [I3ux\i = Ku{C) , 
u{0) = uo, u(l) = ui, 



(3) 



where /3{x), k{x), and f{x) may have discontinuity at the interface x = ^, K > 0 
is a constant. 

Further we use the notations from [11]. Let us introduce an uniform grid, 
Xi = ih, i = 0,1,. ..,7V with h = 1/N and level set function (LSF) cj>{x) so 
that = 0, 4>{x) < 0 for x G f2~ and </)(x) > 0 for x G [9]. The unit 
normal points from f2~ to C'*' and using the LSF it is computed at grid node 



Xi, i = 1, ..., TV - 1: 



n, = 



IJi+l 



Pi-1 



2h 



Pi+l 



Pi-1 



2h 



4’i+l ~ 4>i-l 
\4>i+l ~ 4'i-l\ 



Since n = = ±1, then [(3un]r = [I3ux]rn^ = Kur and hence = 

Ku{^)ip}. 

The finite difference approximation to equation (3) can be written as 
LuU, = - kiUi -F,^-Ff = U 






Consider the left arm of the three point stencil, i.e. the line segment connect- 
ing Xi-i and Xi- If (f)i < 0, (pi-i < 0 or if both tpi > 0, (j)i-i > Oj then f3i-ij2 is 
standard = /3(a^i-i/2)) and = 0. Otherwise, define 



P = 



\<jPi-i 



and Ur = pn\Ui -h (1 — p)n\_xUi-i. 



\<ip^-l\ + \<Pi\ 

li (j>i >0, (j>i-i < 0 then the additional term F^" is 



= lE-KUr, 



A-1/2 — P — 



(3-(3+ 



pP+ + {l-p)p-' 
li 4>i < 0, 4>i-i > 0 then the additional term is 






Pi-1/2 — P — 



P-P+ 



pP + (1 ~ p)P~^ 



Next, let us consider the right arm of the stencil. If pi < 0, pi+i < 0 or if 
both pi > 0, pin > 0, then Pin /2 = P{p^i+i/ 2 ) and F^ = 0. Otherwise, define 



P = 






\Pi+l I + \Pi 



and Ur = pn\U, -f (1 - p)n\nUin- 
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li (j)i >0, 4>i+\ < 0 then the additional term F/* is 



F« = 4^KUr, A+1/2 = ^ = ^ 



^ P/3+ + (1 - p)/3- ■ 

li 4>i <0, > 0 then the additional term F/^ is 

* /3-h^ ^ p/3- + (1 - p)/3+ ■ 

With this algorithm the difference scheme for the problem (3) reads as follows: 
at regular grid points Xi, i ^ + 1, (x/ < C < xj+i): 



r, Ui+1 Ui Ui Ui—1 , r 

Pi+1/2 R-1/2 ^2 = /d 



h2 

at the irregular grid points Xi, i = I , I + 1: 



( 4 ) 



^Uj+i -Ui ^ Ui- Ui., , /3 1 - p/ , 

/3 /3/-1/2 zi — I — KUr = fi , (5) 



/l2 



/3+ h 
d pi 



Q Ui-\-2 Ui-\-i Ui ] tt ^ cj li^TT f { a\ 

l^i+i ^ ki+iUi+i - -^-f^KUr = fi+i , (6) 



where pi = {^ — xi)fh f3 = /3+/3“/(p//3+ + (1 — Pi)P~) , and for we use 
interpolation formula Ur = P/Cfj+i + (1 — p/)C//. 

We define the local truncation error (LTE) as T, = LhU, — fi- So T, = 
0{K ^) , i ^ J, / + 1, and Ti = 0(1) ,/ = /,/ + ! and the scheme (4)-(6) has first 
order accuracy to the problem (3). 



3 Finite Difference Scheme for 2D Model Problem 

We consider the problem on the square 17 = [0, 1] x [0, 1] 

Vo (/3(x, y)Vw(x, y)) = /(x, y) , (x, y) e 0\F, 

u{x, y) = g{x, y ) , (x, y) S 917, (7) 

Mr = 0, [fiun]r = K{x,y)u{x,y) , (x,y) G U, 

where F : i))(x, j/) = 0 is the interface curve. We want to give accurate numerical 
method using Cartesian grids. Let introduce a uniform grid with mesh sizes 
hi, /i 2 and fVihi = 1, N 2 h 2 = 1. At a regular grid point (away from the interface) 
we use standard central difference method. At irregular grid point, when the 
five-point stencil crosses the interface, see Fig.l, we modify the scheme. 

Let us introduce the level set function cj>{x,y), so that cj>{x,y) = 0 when 
(x,p) G F, 4>{x,y) < 0 in Q~ and 4>{x,y) > 0 in 17+ . The unit normal is 
n{n^,n?), points from [2~ into 17+, and the normalized tangent direction is 
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Fig. 1. The geometry at an irregular grid point near the interface. 



t(— n^). Suppose that [Put]p = 0, see [7], then the derivative jump condition 
in (7) can be rewritten as two separate jump conditions: 

[f3uj:]p = Ku{x,y)n'^, [Puy]p = Ku{x,y)n^, {x,y) e T. 

The discretization at each grid point {xi^yj) is 




Ui-\-\^j Ui^j 



A-1/2J 



hi 



ij 



Aj + 1/2 



U, 



i,j+l 



-u, 



* J 



hj 



- Ap-l/2 



hi 



Df 



D^. - 



= k. 



3- 



Each /3i+i/2j is evaluated based on the side of the interface that (xi,yj) and 
(xi+i,yj) lie on. If they lie on opposite sides of the interface, then 



^ f3+p- i\^-\ + \(j3+\) 



where j3^ and /3 are the limiting values of the coefficients, <p = 

(t>+ = (j){xi+i,yj) if {xi,y,) € f?" and {xi+i,yj) € = D[j + Df^ , DY^ = 

DT + D^. 

Consider the left arm of the stencil, i.e. the line segment connecting (xi,yj) 
and {xi-i,yj). If (f>ij > 0 and (jh-ij > 0 or if 4)ij < 0 and < 0 j then 

Df^ = = 0. In other cases we define: 



P = 



14^1— i,j I 

\<Pi,j I + {(f’i-lj I 



Ur = K {Upjul^p + - p)) . 



If 4>i,j < 0 and 4>i-ij > 0, then (i); if (pij > 0 and ^ 0i then (ii)\ 



(i) D 



L 



Pjj 

~WhT"'' 



Oi) DYj 



Pi-lj2,j Pjj 



In a similar way we work in other cases, if the interface crosses the right arm 
(i?), the top arm (T), and the bottom arm (B) of the stencil. 

The LTE at the regular grid points is 0{h^) and at the irregular grid points 
it is 0(1). This leads to first order accuracy of the proposed method. 
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4 Numerical Analysis 



Example 1: Constant diffusion coefRcients 



/3 



—X^u(x), 


X G (0,^) U (C,l) , m(0) = m(1) = 0, 


0, 


= Ku(^), 




^ < X < 1, 


l(r)^ 


0 < X < ^. 



For K = —1, P = 1, /3+ = 4, ^ = 0.5, and A = 4.7371 the exact solution is 



u{x) 



sin|^/sin|l- , 
sm / sin , 



0 < a; < ^, 
^ < a; < 1. 



Example '2,: Variable diffusion coefficients 



iPux)x = f{x), a; e (0,^) U (^,1), 

= 0 : = Ku{^), 

a _ f exp(^ — a:) + l, 0<a;<^ 

[a:+l, ^<a;<l. 

We choose the exact solution to be 

+ X — C ^/2 + ^ + (^^ — 1)/(1 + ^ — ^ K ), 0 < X < ^, 

\-exp(^-x) + (l+^-K)/(l + ^-^K), C<x<l. 



In Table 1 the mesh refinement analysis in maximum norm for the Examplel, 
2 are given. The local (Too = max |Ti|) and global (Too = niax \u(xi) — UA ) 

Q<i<N 0<i<N 

error are done. The results confirm that the method is first order accurate, see 
the rate of convergence m. 



Table 1. Mesh refinement analysis in infinity norm for Example 1, 2. 



N 


Constant coejficients 


Variable coejficients 




Too 


Toe 


Too 


Too 


15 


6.0095e - 03 


2.1314 


1.8697e - 02 


5.9143 


31 


3.0667e - 03 


2.1538 


9.1722e -03 


5.9591 


63 


1.5473e - 03 


2.1656 


4.5401e - 03 


5.9770 


127 


7.7697e - 04 


2.1883 


2.2583e - 03 


5.9884 


255 


3.8929e - 04 


2.1889 


1.1262e -03 


5.9942 


m 


0{h) 


0(1) 


0{h) 


0(1) 
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Example 3: Line interface and continuous coefficients. 

-Au{x,y) = f{x,y) , (x,y) e f2\E , : [-1, 1] x [-1, 1] , 

' du 



[^]r “ 



f{x) = 



dn 



= -V2u, (x,y) G r ■. x = y, 



-2exp(a;-y), {x,y) G Q+ ■. x > y, 

0 (x,y) € : a; < y, 

BC: from the exact solution. 



The exact solution is: 



J exp(x - y) (x, y) e f?+, 
{x,y)Gf2-. 



Exact solution 



Local truncation error 




(a) Exact solution, Ni = N 2 = 32. (b) LTE of the computed solution, Ni = 

N 2 = 32. 

Local truncation error Error 





(c) LTE of the computed solution, A^i = (d) Error of the computed solution, Ni = 
33, N 2 = 28. 33, N 2 = 28. 

Fig. 2. Solution for Example 3. 
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On Fig. 2a the exact solution of the Example 3 is given. When the interface is 
aligned with the grid nodes, N\ = N 2 = 32, (see Fig. 2b), the LTE is 0{h) and 
the method convergence with second order accuracy. Typically the LTE is 0(1), 
when Ni N 2 , (see Fig. 2c) and the method is of first order, see Fig. 2d. The 
results of numerical accuracy tests are given in Table 2. 

Example 4- Curvelinear interface and continuous coefRcients. 



Au = Q, (a^, 2 /)eO\r, O: [-1,1] X [-1,1], 



du u 

dn\ p To ’ 



(x,y) er : + 2/2 = r^. 



BC: from the exact solution. 



The exact solution is: 

J 1 if + 2/2 < rg : O", 

^ I 1 + In ^i/x2"+^/ro^ if x'^ + y'^ > : 17+. 

On Fig. 3. a the exact solution of the Example 4 is given. The interface is not 
aligned with the grid nodes for arbitrary Ni, N 2 , the LTE is 0(1), (see Fig. 3.b 
for Ni = N 2 = 32 and tq = 1/2) and the method is first order accurate, see 
Table 2. 



Exact solution u 



Local truncation error 





(a) Exact solution, Ni — N 2 = 32. (b) LTE of the computed solution, Ni = 

N 2 = 32. 

Fig. 3. Solution for Example 4. 



Example 5: Curvelinear interface and discontinuous coefRcients 



iPu^)x + iPuy)y = f{x,y ) , 






= Ku, 



J r 



(+2/)GO\r, O: [-1,1] X [-1,1], 
{x,y) G E : x2 + 2/2 = rg. 
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\ Q! 2 ^“^ ^ r >ro, 
BC: from the exact solution. 



The exact solution is: 

/ r“V/3i if r <ro : f2~ , 

\r“V/32 + ?'oV/3i -'^ oV/32 if r > tq : 17+. 

On Fig. 4a the exact solution —u of the Example 5 for the j3i = 1, /?2 = 2, Oi = 
4, 02 = 2, iF = 8, To = 0.5, Ni = N 2 = 32 is given. On Fig. 4b the error in 
maximum norm is presented. As the LTE is 0{h?) at regular grid points and 
0(1) at irregular one, the error is greather near the interface. On Table 2 the 
results of the numerical accurasy test are given. 



Exact solution -u 




(a) Exact solution —u, Ni = N 2 = 32. (b) Error of the computed solution, Ni = 

N 2 = 32. 

Fig. 4. Solution for Example 5. 



Table 2. Mesh refinement analysis in the infinity norm for Example 3, 4, 5. 



Example 3 


Example 3 


Example 4 


Example 5 


Ni 


N 2 


Eoo 


Ni 


As 


Eoo 


Ni 


As 


Eoo 


Ai 


As 


Eao 


8 


8 


6.2507e-04 


7 


9 


7.0050e-03 


8 


8 


6.0810e-02 


8 


8 


1.2450e-02 


16 


16 


1.5671e-04 


14 


18 


2.6699e-03 


16 


16 


1.9841e-02 


16 


16 


3.4307e-03 


32 


32 


3.9207e-05 


28 


36 


1.2028e-03 


32 


32 


6.5706e-03 


32 


32 


1.0885e-03 


64 


64 


9.8034e-06 


56 


72 


5.7097e-04 


64 


64 


2.3636e-03 


64 


64 


4.4329e-04 


128 


128 


2.4510e-06 


112 


144 


2.9623e-04 


128 


128 


1.1473e-03 


128 


128 


1.9217e-04 



Conclusions 

Numerical procedures for elliptic problems with interface jump conditions, in 
which the solution is continuous, but the jump of the flux is proportional to 
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the solution, are developed. The most important fact is that the algorithms are 
implemented using standard finite difference discretization on a Cartesian grid. 
The numerical experiments show first order accuracy of the method under the 
assumpion that the interface and the solution are such that \f3ut\r = 0. 

Theoretical results (accuracy and convergence) are subject of a forthcoming 
paper. 

References 

1. J. Chadam and H. Yin. A diffusion equation with localized chemical reactions, 
Proc. of Edinburgh Math. Soc. 37 (1993) 101-118. 

2. Z. Chen and J. Zou. Finite element methods and their convergence for elliptic and 
parabolic interface problems, Numerishe Mathematik 79 (1998) 175-202. 

3. H. Huaxionhg and Z. Li. Convergence analysis of the immersed interface method, 
IMA J. Numer. Anal. 19 (1999) 583-608. 

4. J. Kandilarov. Comparison of various numerical methods for solving of elliptic 
interface problems, FILOMAT 15 (2001) 257-264. 

5. J. Kandilarov and L. Vulkov. Analysis of immersed interface difference schemes 
for reaction-diffusion problems with singular own sources. Comp. Meth. in Appl. 
Math. 3 2 (2003) 1-21. 

6. Z. Li. The Immersed Interface Method — a Numerical Approach for Partial Dif- 
ferential Equations with Interfaces, PhD Thesis, University of Washington, 1994. 

7. X. Liu, R. Fedkiw and M. Kang. A boundary condition capturing method for 
Poisson’s equation on irregular domains, J. Comp. Phys. 160 (2000) 151-178. 

8. X.-D. Liu and T. Sideris. Convergence of the Ghost Fluid Method for elliptic 
equations with interfaces. Math. Comp, (to appear). 

9. S. Osher and R. Fedkiw. Level Set Methods and Dynamicimplicit Surfaces, Spinger- 
Verlag, 2003. 

10. C. Peskin. Numerical analysis of blood flow on the heart, J. Comp. Phys. 25 (1977) 
220-252. 

11. A. A. Samarskii, The Theory of Difference Schemes, Nauka, Moscow, 1977, (in 
Russian) . 

12. P. Vabishchevich. Tracking difference schemes for some problems of the mathemat- 
ical physics, Diff. Uravnenia 24 7 (1988) 1161-1166, (in Russian). 



Generalized Nonstandard Numerical Methods 
for Nonlinear Advection-Diffusion-Reaction 

Equations 



Hristo V. Kojouharov^ and Bruno D. Welfert^ 



^ Department of Mathematics, University of Texas at Arlington, P.O. Box 19408, 

Arlington, TX 76019-0408 
hristoOuta . edu 

^ Department of Mathematics, Arizona State University, P.O. Box 871804 
Tempe, AZ 85287-1804 



Abstract. A time-splitting method for nonlinear advection-diffusion- 
reaction equations is formulated and analyzed. The nonlinear advection- 
reaction part of the problem is solved using a new generalized nonstan- 
dard method based on a Lagrangian formulation and a linearizing map. 

The diffusion part is handled with standard finite difference schemes. 

This approach leads to significant qualitative improvements in the be- 
havior of the numerical solutions. 

1 Introduction 

Mathematical models that involve a combination of advective, diffusive, and re- 
active processes appear in many fields of engineering and science. Standard nu- 
merical methods developed for problems with smooth solutions often behave very 
poorly when applied to advection-dominated problems. Eulerian-Lagrangian ap- 
proaches [3, 1,4], among others, have greatly improved the treatment of advec- 
tion-dominated transport problems, but still little has been done for problems 
with nonlinear reactions. Reactions with unstable equilibria and thresholds can 
cause small numerical errors to oscillate with increasing amplitude, leading to 
eventual machine blowup [7]. 

In this paper, we consider the following advection-diffusion-reaction equation 



for the unknown function c = c(x,t) in a spatial domain fi = [0,1], over a 
time interval J = (0,T], together with the initial condition c(a;, 0) = g{x) and 
periodic boundary conditions. For convenience, we also assume that all given 
functions are spatially f?-periodic and all quantities are dimensionless and scaled. 
In applications to groundwater hydrology, c S [0, 1] denotes the concentration of 
a dissolved substance, v{x, t) is the Darcy velocity, D{x, t) is the hydrodynamic 
dispersion coefficient, and r(c) accounts for chemical reactions. 

We present a time-splitting algorithm based on a generalized nonstandard 
method that efficiently handles the numerically challenging transport equation 

I. Lirkov et al. (Eds.): LSSC 2003, LNCS 2907, pp. 465-472, 2004. 
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(1). In the first step the advection-reaction equation {D = 0) is solved using 
a generalized nonstandard method [5] . It allows us to follow the transport and 
track sharp fronts much more accurately than with standard numerical schemes. 
In the second step of the time-splitting procedure, the diffusion part is computed 
using standard finite differences. 

The outline of the paper is as follows. In Section 2, we present the general- 
ized nonstandard method for advection-reaction equations. An error analysis of 
the method is also included. We then describe the time-splitting procedure in 
Section 3. Numerical results and comparison with analytic solutions and stan- 
dard finite difference schemes are presented in Section 4. In the last section, 
conclusions and future research directions are outlined. 



2 Nonlinear Advection-Reaction Eqnations 



We first consider the problem (1) with no diffusion {D = 0), subject to the 
initial condition c(a;,0) = g{x) and periodic boundary conditions. The nonlinear 
reaction term is of the form r(c) = /(c)//'(c), where / is a given function, 
assumed w.l.o.g. to be nonnegative. The motivation behind such a choice for 
r(c) is the fact that the resulting nonlinear differential equation in c 



dc dc 

reduces to the linear differential equation 

df{c) dfjc) 

dt dx 



fjc) 

f'ic) 



= /(c), 



( 2 ) 

(3) 



in /(c). In the case of polynomial reactions r(c) 
Ofc’s, solving for / yields 



m 

J^(c — ttfc), with distinct 

fc=i 



/(c) = n |c- afc^^ 

fc=i 



7fc 



m 



n 




(4) 



Using the method of characteristics , the solution of Equation (3) can be written 
as 

/(c(a;,t)) = /(g(s))e‘, (5) 

with s = s{x) and where s = ^(0) is the solution at time t = 0 of the initial- value 
problem 

= w(/,t), ^{t)=x. (6) 

Given a temporal grid {t‘^}n=o a spatial grid {xi}fiQ a comparison of 
the analytic solution at times and yields the generalized nonstandard 
method [5]: 

pn+l _ 



gZit" _ I 






(7) 
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Here, the quantity = f(Cr~^^) denotes the numerical approximation of 

with 

= f{C°) = f{g{x^)) (8) 

and F^{x2) is the numerical solution at xf- The backtrack point is given by 
xf = where ^ is the solution of the initial- value problem (6) subject to 

the condition = Xi. The numerical solution of (2) at time is then 

recovered from 

{F^^) , (9) 

where f~^ is the inverse function of /. Note that because of its special form (4) 
/ is monotonic between two zeros/poles of r so that is well-defined. The 

determination of from via (9) requires either an explicit expression 

for f~^ or a numerical procedure (e.g., Newton-Raphson method) for solving 

/(cr+^) = F”+\ 



Error Analysis. Assuming a constant velocity field v{x, t) = v, the solution a;” 
of the initial-value problem (6) is given by a;” = a; — vAt^ with At" = — f". 

Then the general solution F”+^(a:) of the semi-discrete version of scheme (7) 
satisfies: 

Fn-i-i(j,) _ ^At "- = . . . = F*^ (a: — 



= f{g{x — = f{c{x, F+^)). 

Using an exact formula for we can recover the numerical solution: 

C^+\x) = r\F-+\x)) = /-i(/(c(a:,F+i))) = c(x,F+i). 

For arbitrary velocity fields v{x, t) the initial-value problem (6) cannot be 
integrated exactly and an appropriate approximation of the backtrack point, e.g., 
a;" « Xi — u(a:i, (as in the modified method of characteristics [3]) must 

be used. Also, to evaluate F"(a;”), some type of interpolation of the approximate 
solution values {F"}^g must be used. The following theorem provides a bound 
on the error between the exact solution c and the numerical approximation C 
when piecewise-linear interpolation is used [5]. 

Theorem 1. Assume f and the solution of (2) belong to C^([0, 1]) x C([0,T]) 
and that v is bounded. Let the approximate solution be defined by (9) with 
F”(a:") = (/i{F”})(x”) the piecewise linear interpolant o/{F"}. Assume also 
that the nonlinear equation (9) is solved exactly. Then the global error of the 
generalized nonstandard method satisfies 

lie"- - C-ll^ < FTe^ min ||u||ooAx) , 

where Ax = max Ixi — T = At^ + ■ ■ ■ + , At = and 

l<i<M n>0 



L = max (|(/-')'(/(c"-))| , | (/" ')'(/(C"-)) |) max ||(/(c'=))^^ 
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3 Nonlinear Advection-Diffusion- Reaction Eqnations 



We now present a time-splitting method, analyzed in [2], for solving the advec- 
tion-diffusion-reaction equation (1). The basic idea of a time-splitting approach is 
to treat processes like advection, diffusion, and reaction on their own in numerical 
time-stepping, so as to enable an easy use of well prepared, tailored solvers for 
these different processes. 

The solution c(x, at time is determined from c(x,t") as follows. 

First the function c{x,t^) is used as an initial condition for 

the solution of the advection-reaction equation 



d(f,T 



-I- V 



dx 






( 10 ) 



The solution c“’’'(x, generated from this step is then used as an initial data 
c'^(x,t”) = for the diffusion equation 



dc<^ 



A 

dx 




= 0 . 



( 11 ) 



Finally the new solution at time is defined by c(x,t"+^) = c‘^(x, t”+^). For 
problems with small diffusion, i.e., for advection-dominated transport problems, 
this splitting approach leads to more accurate representation of the physics of 
the problem [2]. 

The numerical implementation of the time-splitting method is as follows. 
Assume C", . . . , are known. Set i = 1, . . . , M, such that 



Fr' - nc^x2)) 



oZit" 



- 1 



= /(^"(xr)). 



( 12 ) 



where C"(a;") is the numerical solution at the backtrack point a;". Then, the 
approximation of the advective-reactive solution c“’’' of (10) is given by 



a 



■n+^ 



= /-I . 



(13) 



In the second time-splitting step we solve the following implicit finite-differ- 
ence scheme: 



- a 

At” 



,n+T 



d 



dx 



dc 



- 9— £»"+i — 



dx 



n+1 



d 



-(I-^)aZ AT 



dx 



dc 



dx 



n-l-i 



= 0, (14) 



where 0 € [0,1] is a given parameter and 



A 

dx 



D' 



f) 

OX ) 



Z\xj_i + Axj \ -^“^2 



/^m r^ra 

j^k ^ S'+i S' _ j^k S '-"i-i 



Axi 



Axj-i 



is the centered second difference approximation of the diffusion term with 






Xj + Xj±i 
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Taylor series expansions show that the formal order of consistency in the 
above (simple) operator splitting is equal to one. In an early paper by Strang [ 8 ] 
it was pointed out that the formal first order of consistency can be raised to a 
second order if the computations are reversed at each time step. Another funda- 
mental question for any splitting application is how to relate numerical errors at 
sub-steps with the splitting error. It is clear that a certain balance should exist, 
since otherwise part of the computation would readily be too accurate or inac- 
curate, resulting in loss of efficiency. Since the proposed nonstandard method 
provides an accurate and efficient numerical solution in the advection-reaction 
step of the algorithm, an improvement in the numerical treatment of the dif- 
fusion step becomes crucial to the overall performance of the global numerical 
scheme. 

These and other related issues will be more closely examined and analyzed 
in a follow-up article [ 6 ] . 



4 Numerical Results 



In this section we explore the ability of the generalized nonstandard method to 
solve the nonlinear advection-diffusion-reaction equation ( 1 ) and compare it to 
a variety of standard numerical methods. 

First, we examine the following advection-reaction equation 

dc dc /I N / N /I r\ 

= (15) 

subject to the initial condition c(x, 0) = 5 — 5 cos(27ra::). The Lagrangian equation 
dc/dt = c(c— 1 / 2 ) (1 — c) has an unstable equilibrium at c = 1/2 and stable 
equilibria at c = 0 , 1 . 

We consider a constant velocity field v = 0.2. To solve Equation (15) we use 
the generalized nonstandard method (7)-(9) with 



fM 



w^(l — w)^ ’ 



/-'H 




1_ 

2 




0 < w < i 



(16) 



for At = 0.25 and Ax = 0.01. Figure 1 compares the numerical solution of (15) 
at t = 2.5 and t = 25 using the generalized nonstandard scheme (7)-(9) to the 
solution obtained via the standard (backward) finite difference scheme 



c: 



n+1 



-C" 



At 



+ 



c: 



n+1 



-c: 



n+1 

i-1 



Ax 



+ 



Ax 



= er{C^+^) + {l-6)r{Cr) 



with 0 = 0 (explicit), 0 = 5 (semi-implicit) and 0=1 (implicit). 

The explicit numerical solution is only shown at f = 2.5 since it rapidly blows 
up shortly after. This is because the Courant number vAtj Ax = 5 is larger than 
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Fig. 1. Numerical solutions of (15) generated by the generalized nonstandard method 
(bottom right), and by the standard finite difference scheme with explicit (top left), 
semi-implicit (top right) and implicit (bottom left) advection-reaction at times t = 0 
(dotted line), t = 2.5 (solid line) and t = 25 (dashed line). 



one. The implicit scheme is strongly diffusive. The semi-implicit discretization 
exhibits a better behavior, which can be somewhat expected from its non-local 
modeling of the reaction term. It is interesting to note that no value of 9 will 
produce a solution which eventually tends to the exact square wave solution. On 
the other hand the generalized nonstandard method provides the correct amount 
of local implicitness and leads to the exact solution. 

We now consider the advection-diffusion-reaction equation 



dc dc d 

dt dx dx 




c) = r(c), 



(17) 



with the same velocity v = 0.2 but with a nonzero diffusion coefficient D = 0.004. 
In the first step of the time-splitting numerical procedure, we implement the 
generalized nonstandard method (12, 13) with / and f~^ defined by (16). In the 
second step, the implicit finite-difference scheme (14) is applied. 
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Fig. 2. Numerical solutions of (17) generated by the generalized nonstandard method 
(bottom right), and by the standard finite difference scheme with explicit (top left), 
semi-implicit (top right) and implicit (bottom left) advection-diffusion-reaction at times 
t — 0 (dotted line), t = 2.5 (solid line), t — 25 (dashed line) and t = 250 (dash-dot 
line) . 



Figure 2 shows the numerical solution generated by the generalized nonstan- 
dard method together with the finite difference approximation given by 



f^n+l /-<n ^n+1 /^n+1 /~<n 



At 



Ax 



Ax 



^ ^n+1 ^n+1 ^n+1 ^n+1 

^rr^n+l\ I 7-in-l-l ~ '-'i ~ 



+ (1-0) (r(Cr)+i?" i- * 



(Axy 






(Ax)'^ 



with 0 = 0, 0 = i and 0 = 1 at times t = 0, 2.5, 25, and 250. 

The problem is highly advection-dominated, since the grid Peclet number is 
vAx/D = 1/2, and the explicit standard finite difference scheme yields a very os- 
cillatory solution. Note that the explicit scheme applied to (17) is more unstable 
than in the diffusion-free example (15) because the discretization of the nonlin- 
ear reaction term introduces “negative” diffusion into the scheme. Similarly to 



472 Hristo V. Kojouharov and Bruno D. Welfert 



the diffusion- free case the implicit and semi-implicit schemes are somewhat more 
diffusive than the nonstandard scheme, where the diffusion almost balances with 
the reaction. All stable standard numerical solutions eventually converge to the 
wrong “steady state” solution c = 1 . 

5 Conclusions 

A generalized nonstandard numerical method has been developed for solving 
advection-diffusion-reaction equations. The new method suppresses elementary 
numerical instabilities that occur with standard numerical schemes. It leads to 
significant, qualitative improvements in the behavior of the numerical solution. 

We are currently investigating more accurate, symmetrized time-splitting 
algorithms and more accurate numerical solution of the second/diffusive step in 
the time-stepping algorithm. Results will be reported in a follow-up article [6]. 
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Abstract. In this paper we study numerically blow-up solutions of el- 
liptic equations with nonlinear dynamical boundary conditions. First, we 
formulate a result for blow-up, when dynamical boundary condition is 
posed on the part of the boundary. Next, by semidiscretization, we ob- 
tain a system of ordinary differential equations (ODEs), the solution of 
which also blows up. Under certain assumptions we prove that the nu- 
merical blow-up time converges to the corresponding real blow-up time, 
when the mesh size goes to zero. We investigate numerically the blow-up 
set (BUS) and the blow-up rate. Numerical experiments with local mesh 
refinement technique are also discussed. 



1 Introduction 

Let 17 be a bounded convex domain in R^{N > 1) with piecewise smooth 
boundary i9l7 = U S' 2 , Si C\ S 2 = ^ ■ We consider the following problem: 



Lu = —Am = 0 in 17, 



, , , du , du , . ^ , , 

/i(m) = -^ + = g{u) on Si X (0, 00 ), 

du 

/ 2 (m) = a— — h 6u = 0, a > 0, 6 > 0, a -I- 6 = 1 on 52 x (0, 00 ), 
m(x, 0) = uq { x ) on Si- 



( 1 ) 

(2) 

( 3 ) 

( 4 ) 



Here k > 0,a,b are constants, A is the Laplace operator with respect to the 
space variables and d/dn is the outward derivative to 5i ( or ^2 ) . Problems of 
type (l)-(4) model processes of filtration in the hydrology, in heat transfer in a 
solid in contact with a fluid, see [5, 8] for more information. 

We shall consider nonlinearity g for which blow-up occurs in finite time. 
Blow-up results for differential problems with dynamical boundary conditions 
on the whole boundary start with [8]. Further results were obtained in [1, 2, 5, 
6, 7], where also Si = 9l7. 

In the next section we present existence of blow-up solutions for the problem (1)- 
(4), when 17 is bounded domain and S 2 can be non-empty set. In section 3 we 
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give some theoretical results, concerning the semidiscrete problem. In the last 
section we discuss numerical experiments, using uniform mesh and multilevel 
mesh refinement techniques in the space and Nakagava adaptive algorithm in 
time, [9]. Up to now, there are no analytical results, concerning blow-up set 
(BUS) and blow-up rate of the problem (l)-(4). We investigate these questions 
numerically. 



2 Blow-Up in the Continuous Problem with ^ df2 



The method used in this section to demonstrate blow-up makes use of the prin- 
cipal eigenfunction ipi of the following Steklov spectral problem: 



Alp = 0 in 17, 


(5) 


,dip 

k-— = X'lp on Di, 
an 


(6) 


0 / 

a— — h 6^ = 0 on 5*2 . 
on 


(7) 



Lemma 1. We suppose that fl, Si, S 2 , k, a, b are as defined in Section 1. Let 

Ai — ini - 7 . , 



The following assertions hold: 

1) X\ is the smallest non-trivial eigenvalue of the problem ( 5)-(7); 

2) there exists a function ij^i g such that > 0 and 

^ F{'>pi,'tjji) 



( 8 ) 



3) (8) is the necessary and sufficient condition the function tpi^x) G H^{Q) 
to be the eigenfunction of (5)-(7) corresponding to the eigenvalue Ai. 



Theorem 1 We suppose that fl, Si, S 2 , k, a, b are as defined in Lemma 1. Let 
uo(x) > 0, uo(x) ^0, X G L2,uq G C'(S'i) and 



Uo= uo{a)ifi{a)da > <5o > 0- 
Jsi 



We assume in addition that g{v) is convex on and 

g{v) — kXiv > 0 for v > So, 
dv 



j 

Js, 



60 gi^’) - 



< 00. 



Then, the solution of (l)-(4-) blows up infinite time 



(9) 



( 10 ) 



( 11 ) 
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Proof. The efforts are concentrated to show that U{t) = f u{x,t)ipi{a)da oo 

Si 



as t — > T{, < oo. 

Remark 1. The result of Theorem 1 holds when one considers in (1) instead of 
the Laplace operator, more general elliptic operator, like 

Lu = — div(fc(a;)VM) -I- c(x)u, 0 < fco < k{x) < k\, c(x) >0, in 17 



and boundary conditions 

, , . du , , , ^ , 

^ 9{u), on 6iX(0,oo), 

Ou 

l 2 {u) = k{x )— — h 6 (x)m = 0, fe(x) > 0, on 82 . 



3 Blow-Up in the Semidiscrete Problem 

Since our objective in the approximation is studying the problems arising from 
blow-up properties of the differential solutions, we will take 17 to be the square 
(0, 1)^. Let for definiteness S'! = {x = (xi, X2)|0 < xi < 1, X 2 = 1}. We introduce 
the uniform mesh in 17 



UJ = {(xi,X 2 ) e 17 : Xi^ = (i - l)hii = 1 , {N - l)hi = 1, 

X2, = {j - l)h2j = 1, ..., M, (M - 1)^2 = 1}. 

Using FEM (linear triangular elements) with mass lumping, we get the ODEs: 





;i I/X 2 X 2 — 


2 = 2,. 


..N 


-1, 


j 


= 2 ,. 


..,M 


-1, 


1 dy 
k dt 


2 DxiXi + Ux2 — I^9{y)y 


i = 


2,. 


,..,iV- 


-1, 


J = M, 


ahi 


VX2X2 ^Vx\ 


by = 0, 


i ■ 


= 1 , 


j = 


= 2,.. 


.,M 


-1, 


ahi 


Ux2X2 ^Uxi 


by = 0, 


i ■ 


= 1 , 


j = 


= 2,.. 


.,M 


-1, 


a/i2 


yX\X\ ^Vx2 


by = 0, 


i : 


= 2 ,. 




, j = 


= 1, 


—ahy 


x\ ^yx2 ^(1 


+ h)y = 


= 0 , 


i = 


1, 


j = 


1, 




ahys:^ 


- ayx2 + ^(1 4- 


h)y = [ 


3, 


i = N, 


j = 1 






a dy 
k dt 


- + ayx2 


+ bhy = 


a 
^ k 


9(y), 


i 


= 1, 


J = 


M, 


a dy 
k dt 


-f ahyx^ + 


-1- bhy = 


a 
^ k 


9(y), 


i 


= N, 


j = 


:M, 


2/(0) = 


= Uo{x), i = 1, 


...,fV, 


j = 


M. 











The local truncation error is 0{hf + K^). We have denoted with y{f) = yij{t) = 
y{xii, X 2 j, t), i = 1, N, j = 1, ..., M the values of the numerical approximation 
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at the nodes x = {xn^x^j) in the time t. Also, the usual notations for finite 
difference derivatives from [10] are used and h = hi/h 2 - 

For the scheme (12) we say that the solution y has a finite blow-up time, if 
there exist a finite time such that 

lim lly(i)lloo = max|yy(t)| = -hoc. 

t— >T" *— 

Theorem 2 Let the assumptions of Theorem 1 hold. Then the solution of (12) 
blows up in a finite time. 

Proof. We show that 



Uh{t) = ^ hiyiM{t)'tp’({xi,) + (a^ii) + yNM{t)'ip^{xi^)\ oo 

i^2 

as t — > T() < oo, where '0i is the first normalized spectral function of the discrete 
problem, corresponding to (5)-(7). 

Theorem 3 Let u be a regular solution of (1)- (4) (u G x [0, T(, — r]), r > 

0) and y is the numerieal approximation, given by (12). Then there exists a 
constant C, depending on ||u|| in x [0,Tb — rj) such that 

11''^ ~ y|licx,(r2x[0,T6-r]) — . 

Proof. It makes use of the conception for supersolutions (subsolutions). 

We remark that our convergence rate is not optimal (we have instead of 
h^), because we are only assuming that u G x [0, Tb — r]). 

Now, we can state the result concerning the convergence of the numerical 
blow-up time. 

Theorem 4 Let u be a regular solution of (l)-(4). IfTb and T() are the blow-up 
times of problems (l)-(4) and (12) respectively, g, uo(x) satisfies (9)-(ll), then 
^Tb{h^ 0). 

Proof. The proof is based on estimates for Tb and T(), obtained in Theorems 1 
and 2, respectively. 

Full proofs of Lemmal and Theorems 1-4 will be given in a forthcoming 
paper. 

4 Numerical Experiments 

We shall use a full discretization of (12), where the time step for the 

{n l)-th level is chosen as in [9]: chose tq, then t'^+^ = tq x min{l, }, 

n = 0,1,2,... (tq < {hi h.2)/4 for explicit scheme). The computations in the 
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examples below are performed with implicit-explicit scheme i.e. the differential 
operator is approximated by implicit scheme, while the nonlinear term g{u) = 

4.1 The local grid refinement techniques are effective to resolve the local phys- 
ical behaviour of the solution. As the blow-up phenomena occurs only on the 
boundary with dynamical nonlinear boundary condition, the fine mesh is not 
needed in the whole domain, where the solution has moderate variation, [3]. We 
use a grid with local mesh refinement, shown on Figure la, variable in the time. 





Fig. 1. Numerical blow-up solutions 



The sizes of coarse grid cells are /i 2 c and the sizes of fine grid mesh are hi f, 
h- 2 f, hif. = (2m + l)hi f, /i 2 c = (2n-|-l)h2/, m,n > 0 integer- valued. The grid 
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points are the centres of the cells. By qq we denote the number of coarse lines, 
covered with fine mesh. On each time level, we follow the local mesh refinement 
method, proposed in [3]. Let 

uo(a;i) = 20(cos87r(a;i - i) + 2), g{u)=u'^, g = 5, 
a = 0, 6=1, To = 0.05, k=l. 

We examine our algorithm for the specific example: initial function with P 
(P = 4) equal maxima. Let UkiM, * = 1,2, ..., P are the peaks of the numerical 
solution in the mesh point Xk-M, XkiM & Si. Our basic approach is as follows: 

(1) Start with mesh refinement: qq := si, m := S 2 , n := S3, (si > 1, s^ G Z). 

(2) Compute the numerical solution. 

(3) If \vkpM - yk,M \ > e, (e = l.e- 11 in our example), p I, p,l = I, ...,P, 

then set hij. := hi^/{2m + 1), 6,2/ := /{2n + 1), e := e + tolerance, where 

tolerance is not small number. Recompute the solution. 

( 4 ) If \ukpM — UkiM\ < £, then set t ■.= t + r, (r is chosen as in [9]). 

( 5 ) Repeat from ( 2 ). 

On Figure lb we show the numerical solution of (l)-(4), computed with 
uniform mesh N = M = 41. 

On Figure lc,d we display the numerical solution of (l)-(4), using mesh 
refinement technique, variable in the time, starting with W = Me = 21, qq = 3, 
m = n = 1. 

4.2 The examples above show that the solution of (l)-(4) blows up on S\ in 
the point (s), in which the initial function gets it’s maximum value (s). Therefore, 
in the case when the initial data have maximum value only in (or near) the 
north-east (north-west or both) corner(s) we could apply a local mesh refinement 
techniques only in this corner(s). In addition to the notation in 4.1, we initiate 
qqi( 2 ) the number of rows (columns) of coarse mesh. The domain, obtained from 
the intersection of qqi and qq 2 coarse rows and columns we cover with fine mesh 
(Fig. 2a), as in [3]. Let 

Mo(xi) = sin(^) -h 1, g(u)=u«, 
g = 5, a = b = 0.5, , tq = 0.5, k = 1. 

On Figure 2b, 2c the numerical solutions are plotted on the boundary S'/. 
Although the initial function have maximum in the point X\ = 1 (where the 
analytical blow-up is expected), the numerical solution, computed with uniform 
mesh blows up on S\, near this point, (see Figure 2b, = 0.03522353098749, 

N = 91, M = 31). Similar results for other problems are discussed in [4]. In 
order to approach the original blow-up point, we need an adaptive algorithm. 
To this end, we will use the algorithm, described in 4.1, but the condition in 
steps ( 3 ) and ( 4 ) are valued with the distance between the point X\ = \ and 
the mesh point, in which the solution gets the maximum value. 

On Figure 2c we show the numerical solution of (l)-(4), using local mesh 
refinement technique, variable in the time, starting with ggi = 3, gg2 = 2, m = 4, 
„ = 1, = 62, = 0.1, hij « 0.011, 62/ « 0.033, T/* = 0.03466432789105. 
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(a) Local mesh refine- 
ment, m = n = l,qqi = 
qq 2 = 2 



(b) 



(c) 



Fig. 2. Numerical blow-up on the boundary 




(a) (b) 



Fig. 3. Blow-up rate 

4.3 It is known [6], that for elliptic and parabolic problems with nonlinear 
dynamical boundary condition (2), g(u) = g > 1, fc = 1 on the whole df2 
{Si = dQ) the blow-up rate is (T{, — and (Tf, — 2 ( 9 - 1 ) ^ respectively. 

Here we check the blow-up rate for problem (l)-(4). 

On Figure 3a, 3b we show the graphs of ln(y(a;f,, 1, t)) versus ln(Tj^ — t), where 
(xb, 1) is the blow-up point. The slope of the obtained curves measures the blow- 
up rate. We compare the blow-up rate of the numerical solution, computed with 
uniform mesh (the case from Figure 2b) with one with multilevel local mesh 
refinement techniques, (the case from Figure 2c). 

Remark 2. The difference in the BUS and in the asymptotic behaviour between 
solutions of the differential equations and their numerical approximations has 
been observed by several authors in resent years, e. g. [4]. The expected (but not 
proved up to now) blow-up rate in 4.3 is 0.25. 
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5 Conclusions 

The paper presents theoretical results for blow-up continuous and semidiscrete 
solutions to elliptic equations with nonlinear dynamical condition on part of 
the domain boundary. For computation of the blow-up solutions we have imple- 
mented at the space discretization a local mesh refinement techniques, proposed 
in [3]. An algorithm for control of the local refinement grid in the time, which 
takes into account the growth of the numerical solution is proposed. The experi- 
ments show that this algorithm provides significant improvement on the compu- 
tation of the single point blow-up solutions (compare Figure lb with Figure Id). 
In order to construct more efficient adaptive algorithm we need to have a good 
knowledge of the BUS and estimates for the blow-up rate, see for example [4], 
where a parabolic problem with zero Dirichlet boundary conditions is studied. 
For our problem there are no such analytical results. 
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Abstract. A new effective method and its two modifications for solv- 
ing Hermitian pentadiagonal block circulant systems of linear equations 
are proposed. New algorithms based on the proposed method are con- 
structed. Our algorithms are then compared with some classical tech- 
niques as far as implementation time is concerned, number of operations 
and storage. Numerical experiments corroborating the effectiveness of 
the proposed algorithms are also reported. 

1 Introduction 

Linear systems of equations having circulant coefficient matrices appear in many 
applications. For example in the finite difference approximation to an elliptic 
equations subject to periodic boundary conditions [2,8] and approximating pe- 
riodic functions using splines [1,9]. In case when multidimensional problems are 
concerned the matrices of coefficients of the resulting linear systems are block 
circulant matrices [7]. 

In this paper we propose a new method and its two modifications for solving 
Hermitian pentadiagonal block circulant systems of linear equations. It is known 
that these systems have the form 



W X = f, 



( 1 ) 



where 



/ M N S 
N* M N S 
S* N* . . 



0 



S* N*\ 
S* 



W = 



S* 



(2) 



S' 

\N S 



0 



..NS 
S* N* M N 
S* N* M J 
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is a Hermitian pentadiagonal block circulant matrix with block size n. M , N 
and S are to x to matrices, x = are column vectors 

with block size n, Xi and fi, are blocks with size to x 1. 

Our goal is to construct a new effective method for solving of (1) and to 
compare it with some classical techniques. 

The paper is organized as follows: in section 2 we present the new method 
and discuss its two modifications based on different approaches for using the 
Woodbury’s formula [4]; in section 3 we report some numerical experiments 
corroborating the effectiveness of the proposed algorithms. 



2 A Modification of LU Factorizations 



Adapting the ideas suggested in [6] , we construct a new method for solving linear 
systems with coefficient matrices of the form (2) . Our approach is based on the 
solution of a special nonlinear matrix equation. One can fine the solution of (1) 
using the following steps: 

Step 1. Solve the parametric linear system 



Ty = f, 



( 3 ) 



where 

( X Y S \ 

Y* Z N . 0 

S* N* M . . 

^ ~ . ... S 

0 ... IV 

\ S* N* M ) 

is pentadiagonal matrix with block size n. It has a block Toeplitz structure 
except for the north-western corner, y = {yi}i=i,...^n, and / = {/i}i=i,...,n are 
column vectors with blocks size n, yi and fi are blocks with size to x 1. 

Matrix T admits the following LU factorization 



T = LU 



(Ira \ 

. 0 
s*x-^ . 

\ 0 S*X-^Y*X~^I^j 



(XYS 0\ 



0 . Y 

V x) 



where Im is the identity matrix with size m x m. 

The above decomposition exists when the parameters X = X*, Y and Z = Z* 
satisfy the relations 

Z = Y*X~^Y + X 

N = Y*X-~^S+Y (4) 

M = S*X~^S + Z 

Let us introduce the following notations 



A Y 
Y* Z 



Q 




R = 



( M N\ 
\N* M ) ■ 



F 



( 5 ) 
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If is a positive definite solution of the matrix equation 



F + Q*F~^Q = R (6) 

and X = X* > 0, Z = Z* > 0, then the blocks X, Y and Z satisfy the system 

( 4 ). 

Thus, solving the linear system (3) is equivalent to solving two simpler sys- 
tems 

L z = f, z = 

U y = z, y= 

Step 2. Solve the pentadiagonal block Toeplitz linear system 



P u = f, 



( 7 ) 



where 



/ M NS \ 

N* M N . 0 

S* N* M . . 



0 . . . N 

\ S* N* M j 



( 8 ) 



is a Hermitian pentadiagonal block Toeplitz matrix with block size n, u = 
and / = {/i}i=i,...,ra are column vectors with block size n , Ui and 
fi are blocks with size m x 1. 

The matrices T and P satisfy the relation P = T + J 2 V, where 



fl^O 0...0\^ fM-X N-Y0...0\ 

Vo /^o...oy ’ ^ ~ [n* -Y* M - Z 0 . .. oj 



are matrices with block size n x 2 and 2 x n respectively. 
Using the Woodbury’s formula we have 



p-i ^ y_i _ 



+ UT-i J2 



VT 



-1 



(9) 



where / 2 m is the identity matrix with size 2m x 2m. 
For the solution m of (7) we have 



u = p-^f = y-T-^J2 



hm + UT-1 J 2 



Vy. 



One can find the matrix J 2 by solving 2m linear systems of type (3) with 
right-hand sides the corresponding to different columns of J 2 - This approach 
does not take into consideration the very sparse nonzero structure of J 2 . For 
real M, N and S it costs O{20nm^) flops and needs to store 2nm? real numbers. 
In order to decrease the number of operation for computing J 2 we propose 
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a new approach which is based on the ideas suggested in [5] . 

Let us denote the block columns vectors of J 2 with Ei and E 2 respectively i.e. 



El = {imOO ■■ .of , E 2 = {0 1m0---0y 

Put A = Y*X~^ and B = 

The matrix T admits the following decomposition 



T = LDL* , where L = 



(Im 

A . 

B . . 



V 0 B Aim/ 



is a square matrix of block size n and D = diag{X , . . . , X). 
Let . be blocks of the matrix L~^ . We have 






0 



Zi—j+l i ^ j 



i < j , Zi — Im, 

. where 



^2 = -A, 



Z. =A^-B. 



Zj = —AZj-i — BZj -2 for j = 4 . . . n. 



Obviously D~^ = diag{X~^, . . . ,X~^). 

One can compute J 2 by consecutively calculation of T~^Ei and T~^E 2 using 
the following algorithm: 

Algorithm RP Recursively computation for Pentadiagonal system 

— Find the cells Ki = X~^Zi for z = 1, . . . , n by the formulas 

Ki = X-i 
K 2 = -X~^A 

Ki = —Ki-iA — Ki- 2 B for z = 3, . . . n. 



— Compute the blocks (T ^Ei)i and {T ^E 2 )i by the formulas 

{T~^Ei)n = K„ 

= A„_i - A*{T~^Ei)n 

{T-^Ei), = K,~ A*{T~^Ei),+i - B*{T~^Ei),+ 2 fori = n- 2,...,2 
{T~^Ei)i = X-i - A*{T~^Ei)2 - B*{T~^Ei)3 

{T~^E2)n = A„_i 
{T-^E2)n-l = A„_2 - A*{T~^E2)n 

{T~^E2)r = - A*{T~^E2)r+i - B*{T~^E2f+2 fori = n- 2,...,2 

{T~^E2)i = -A*{T-^E2)2 - B*{T-^E2)3- 



If the blocks M, N and S are real the algorithm RP costs 0(12rzm^) flops and 
needs to store (3rz + 2)mf real numbers; According to the above algorithm in 
the next step we consider two different approaches for solving (1). 
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Step 3. Solve the system (1) 

3.1 The matrix W satisfies the relation 



where 



U = 









w 




= P + UV, 










(Im 


0 


0 


0 


\ 




/o 








0 


hm 


0 


0 




0 . 


. s* 


N*\ 












f> 


0 


0 . 


. 0 


s* 












, y — 


s 


0 . 


. 0 


0 


0 


0 




.0 






\n 


S' . 


. 0 


0 / 


VO 


0 


0 













are the matrices with block size n x 4 and 4 x n respectively. 
Using the Woodbury’s formula we have 



Vp-i 



p-i _p-i[7 



him -\-VP 



VP-\ 



where him is the identity matrix with size 4m x 4m. 

The solution x of (1) is obtained from the vector u as follows 



x = W ^f = u — P ^ti 



lim + VP-^U 



Vu. 



Denote the block columns vectors of U with i?i, E 2 , En-i and En respectively. 
Computation of P~^U can be done by consecutively calculations of P~^E\, 
P~^E 2 , P~^En-i and P~^En using formula (9). For i = 1, 2, n — 1, n we 



have 



P-^E,=T-^E,-T-^J2 I2m + VT-^J2 



1 -1 



VT-^Ei. 



( 10 ) 



Numerical implementation of formulas (10) 
know from Step 2 the elements J 2 and 



is very “cheap” since we already 

n -1 



l2m + UT-1 J 2 



. We recommend 



formulas (10) instead of solving 4m linear system of the form (7) with right hand 
side the corresponding column vectors oil! . It is easy to observe that the blocks 
of T~^En-i and T~^En satisfy the relations 



(T-1f;„_i)„ = K*, 

(T-1t;„_i), = for * = 

{T~'^En)i = for i = n,...,l, 



where Ki for i = 1, . . . , n are the blocks from algorithm RP. 

3.2 In order to decrease the size of the inverse matrix in Woodbury’s formula 
we propose the following decomposition of matrix W 



W = 



P V 
V* R 
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where P is from (8), with block size n — 2xn — 2, Ris from (5) and 

, 5 0 . . . iV* 

' iV S' . . . 0 S* 



is a matrix with block size 2 x n — 2. 

Put 

X — ^ Xr , . . . , Xn—2 ) 7 5 — ( Xji— 1 Xji ^ . 



/ = (/l, /n- 2 )^, /= {fn-1 fn)^ , f = 

In this notation system (1) can be written in the form 



P V 
V* R 



f 
IJ' 



which is equivalent to 



where G = P- Pi?- IP*, r = f-VR~^f. 
By Woodbury’s formula we have 



Gx = r 




i = i?-i ( 


f - V*x^ 



Then 



^ p-l ^V*P~^. 

X = G-^r = z + P-^V [R - V*p-^V] V*z, 



where 2 : = P~^r can be computed by means of Step 2. 

Denote the block columns vectors of V with Hi and H 2 respectively. Computa- 
tion of P~^V can be done by consecutively calculations of P~^Hi and P~^H 2 
using formula (9). For i = 1,2 



hm + VT-^J2 



-1 



VT-^H. 



Numerical implementation of the last formulas is again “cheap” since we already 

-1 

. The blocks of 



know from Step 2 the elements T ^ J 2 and 
T~^Hi and T~^H 2 satisfy the relations 



hm + VT-^J2 



(T-iiSi), = (T-iPi),S*+iF:_ 2 _,S + iF:_,_iQ /or i = l,...,n-3 
(T-iiii)„_2 = {T-^Ei)r,-2S* + K*iQ 

{T-^H2)^ = {T~^Ei),N* + {T-^E2)^S* + Kl_i_^S fori = l,...,n- 2 , 



where Q = N — AS and Ki for i = 1, . . . ,n are the blocks from algorithm RP. 
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3 Numerical Experiments 

In this section we compare our algorithms with some classical techniques for 
solving (1), with W given as in (2), and the exact solution a; = (1, 1, , 1)^. 

In our numerical experiments, W is Hermitian Pentadiagonal Block Circu- 
lant, with several block size n. The algorithms are compared by means of exe- 
cution times and accuracy of the solution. 

The codes are written in MATLAB language and computations are done on 
an AMD computer. The results of the experiments are given in different tables 
for each example. 

The following notations are used: LU stands for classical LU factorization; 
CHOL stands for the classical Cholesky factorization; M_RP(4m) stands for 
algorithm based on the proposed new method using step 3.1; M_RP(2m) stands 
for algorithm based on the proposed new method using step 3.2; Err. = \\x — 
i||oo, where x is the computed solution. 

To solve the system (1) we need to compute a positive definite solution of 
the matrix equation (6). The sufficient condition for the existence of a positive 
definite solution is ||i?“2 QR~^\\ < (see [3]). The cells of the matrix W in 
next two examples which form matrices R and Q satisfy this condition. 

Example 1. Let 



/ 8 l-i 


1.5\ 






1 


0 \ 


1 + i 9 


1 


, N = 


0 


2 


0 


V 1.5 1 


8 / 




\l-i 0 


0 / 



/l.2-3z -0.3 0.1 
S = -0.30 2.1 0.2 

\ 0.1 0.2 0.65 -k2i 

We present the execution time (in seconds) and the error, of each algorithm for 
different values of n in the table. 

Example 2. Let 

M = circ(22,-8,l,...,l,-8), N = circ(-7.2, 1.8, . . . , 1.8) 
are circulant matrices and S = I. 

4 Conclusions 

The proposed new algorithms M_RP(2m) and M_RP(4m) are faster than the 
classical LU and CHOL. Theoretical investigation and numerical experiments 
suggest that algorithm M_RP(2m) is most suitable for implementation. This is 
due to the size of the inverse matrix in Woodbury’s formula. The inverse matrix 
in M_RP(2m) is twice smaller than the inverse matrix in M_RP(4m). This leads 
to the considerable decrease of the execution time. 
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Table 1. Execution time (in seconds) and errors 



Algorithm 


n = 4000 


n = 6000 


n = 8000 


Err. |time 


Err. |time 


Err. 1 time 


Example 1, m = 3 


LU 


1.4482e-015 


2.95 


1.4482e-015 


6.04 


1.4482e-015 


10.15 


CHOL 


2.2888e-015 


2.00 


2.2888e-015 


3.23 


2.2888e-015 


5.44 


M_RP(4m) 


4.2635e-014 


1.95 


4.1064e-014 


3.14 


4.7126e-014 


5.02 


M_RP(2m) 


3.3956e-014 


1.63 


4.1081e-014 


2.62 


6.0280e-014 


4.34 


Example 2, m = 7 


LU 


1.7764e-015 


3.49 


1.7764e-015 


6.66 


1.7764e-015 


11.01 


CHOL 


1.9984e-015 


2.39 


1.9984e-015 


3.90 


1.9984e-015 


5.98 


M_RP(4m) 


2.6182e-014 


2.14 


3.2072e-014 


3.40 


3.7038e-014 


4.88 


M_RP(2m) 


2.5537e-014 


1.87 


3.1282e-014 


2.83 


3.6124e-014 


4.47 



The complexity of the proposed new algorithms is 0{nm^). For comparison 
algorithm based on the Fast Fourier Transform (FFT) for solving block circulant 
system with circulant blocks has complexity 0(nmlog(nm)) but it can be im- 
plemented only when the block size of the matrix VF is a power of two and when 
all blocks of W are circulant [2]. Our method does not have these restrictions. 
The only restriction of the applicability of our method comes from the condition 
of the existence of solution of matrix equation (6). 
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